Courses at Princeton as an AI

QCB508 / QCB408 / Foundations of Statistical Genomics

Course ID: QCB 508 / QCB 408 

Title: Foundations of Statistical Genomics

Semester: 2022 Spring

Instructor: Dr. John D. Storey

3.1 Course Website: Canvas

3.2 Office Hour Recap

Preface: This is a brief summary based on the conversations during Yushi’s office hour in Spring 2022. Due to the pandemic, we needed to hold our office hours virtually for this semester. I was unable to provide detailed proof or answers without a blackboard. I listed these details here and I hope that they can help our friends to learn something.

Mar 17, 2022


If my note on May 15, 2022 is too much, please check this simplified explanation and lemma.

\(\underline{Lemma}\). Let \((\Omega,\mathcal{F},\mathbb{P})\) be a probability space, \((E,\mathcal{E})\) a measurable space, \(A\) an event in \(\mathcal{E}\). Define the indicator function \[\begin{equation}\mathbb{1}_{A}=\mathbb{1}_{A}(X) = \mathbb{1}_{A}\circ X = \begin{cases} 1, \,\,\,\text{if } X\in A,\\0, \,\,\,\text{if } X\notin A. \end{cases} \end{equation}\] Then \[\begin{equation} \mathbb{P}(X\in A)=\mathbb{E}[\mathbb{1}_{A}]. \end{equation}\]

Proof. Intuitively \[\begin{equation} \mathbb{P}(X\in A) = \mathbb{P}(X\in A)\cdot 1 + \mathbb{P}(X\notin A)\cdot 0 = \mathbb{P}(\mathbb{1}_{A}=1)\cdot 1 + \mathbb{P}(\mathbb{1}_A=0)\cdot 0 = \sum_i i \cdot \mathbb{P}(\mathbb{1}_A=i) = \mathbb{E}[\mathbb{1}_{A}]. \end{equation}\]

\(\underline{Example\,\,(Distribution\,\,of\,\,random\,\,genotype)}.\) Let \(X\) be the random genotype in the random-allele-frequency model. We want to characterize the distribution of \(X\) but the only information we know is the conditional distribution of \(X\) given individual-allele-frequency \(\pi\), written as \(X|\pi \sim \text{Binomial}(2,\pi)\). Here \(\pi\) comes from some other distributions. A strong candidate for this is BN\((p,f)\). We can use the above corollary to derive the distribution of \(X\). There are three possible outcomes of \(X\) as 0, 1, or 2. For event \(A_2=\{\omega\in \Omega: X(\omega)=2\}\), we have \[\begin{align} \mathbb{P}(X=2) &= \mathbb{P}(X\in A_2)= \mathbb{P}(X\in A_2)\cdot 1 + \mathbb{P}(X\notin A_2)\cdot 0 =\mathbb{E}[\mathbb{1}_{A_2}] = \mathbb{E}\big[\mathbb{E}[\mathbb{1}_{A_2}|\pi]\big] \nonumber \\ &= \mathbb{E}\big[\mathbb{E}[\mathbb{1}_{X=2}|\pi]\big] = \mathbb{E}[\pi^2] = (\mathbb{E}[\pi])^2 + \mathbb{V}(\pi). \end{align}\] The same strategy yields \(\mathbb{P}(X=0)\) and \(\mathbb{P}(X=1)\).

Mar 15, 2022


\(\underline{Notations}\). Let \((\Omega,\mathcal{F},\mathbb{P})\) be a probability space, \((E,\mathcal{E})\) be a measurable space. Expectations of random variable \(X:\Omega\mapsto E\) can be interpreted as integrals with respect to \(\mathbb{P}\). Let \(\mu\) be the distribution of \(X\). We can interpret \(\mu\) as the image of \(\mathbb{P}\) under \(X\). To better understand the red text, let \(\mu=\mathbb{P}\circ X^{-1}\) while defining mapping \(\mathbb{P}\circ X^{-1}:\mathcal{E}\mapsto \bar{\mathbb{R}}_{+}\) by \[\begin{equation} \mathbb{P}\circ X^{-1}(A) = \mathbb{P}\big(X^{-1}(A)\big),\,\,\,\,\,\,A\in \mathcal{E}, \end{equation}\] which is a measure on \((E,\mathcal{E})\). For event \(A\in \mathcal{E}\), we write the probability that \(X\) is in \(A\) as \[\begin{equation} \mu(A) = \mathbb{P}\circ X^{-1}(A) = \mathbb{P}\big(X^{-1}(A)\big) = \mathbb{P}(X\in A) = \mathbb{P}\{\omega \in \Omega:X(\omega) \in A\}, \end{equation}\] which is the image of \(\mathbb{P}\) under \(X\). Note that \(X^{-1}(A)\) denotes the event for every \(A\) in \(\mathcal{E}\). i.e., \(X^{-1}(A) = \{X\in A\} = \{\omega \in \Omega:X(\omega) \in A\}\). It also suggests that \(X\) is measurable relative to \(\mathcal{F}\) and \(\mathcal{E}\).

\(\underline{Theorem}\). Let function \(f:E\mapsto \Omega\) be measurable relative to \(\mathcal{E}\) and \(\mathcal{F}\). We define the composition \(Y=f\circ X\), which is a random variable taking values in \((\Omega, \mathcal{F})\) by \[\begin{equation} Y(\omega) = f\circ X(\omega) = f\big(X(\omega)\big),\,\,\,\,\,\,\text{for}\,\,\,\omega \in \Omega. \end{equation}\] If \(f\) is positive \(\mathcal{E}\)-measurable, then \[\begin{equation} \mathbb{E}[f\circ X] = \int_{\Omega}\mathbb{P}(d\omega)f\big(X(\omega)\big) = \int_{\Omega} f\big(X(\omega)\big) d\mathbb{P} = \mathbb{P}(f\circ X) = (\mathbb{P}\circ X^{-1})(f) = \mu(f). \end{equation}\] Notice that \(\mathbb{P}(f\circ X) = (\mathbb{P}\circ X^{-1})(f)\) holds since the function \(g(f)=\mathbb{P}(f\circ X)\), which is a mapping \(g: \mathcal{E}_{+}\mapsto \bar{\mathbb{R}}_{+}\), satisfies that \(g(f)=\mu(f)\) for a unique measure \(\mu\) on \((E,\mathcal{E})\) according to the integral characterization theorem. More specifically, this unique measure \(\mu\) is \(\mathbb{P}\circ X^{-1}\) due to the fact that \(\mu(A)=g(\mathbb{1}_{A})=\mathbb{P}(\mathbb{1}_A\circ X)=\mathbb{P}\big(X^{-1}(A)\big)=\mathbb{P}\circ X^{-1}(A)\) for \(A\in \mathcal{E}\). Check the corollary below for detailed definition of \(\mathbb{1}_{A}\circ X\).

\(\underline{Corollary}\). Define the indicator function \[\begin{equation}\mathbb{1}_{A}(X) = \mathbb{1}_{A}\circ X = \begin{cases} 1, \,\,\,\text{if } X\in A,\\0, \,\,\,\text{if } X\notin A. \end{cases} \end{equation}\] Notice that \(\mathbb{1}_{A}\) is positive \(\mathcal{E}\)-measurable. Let \(f=\mathbb{1}_{A}\), then we write \(\mu\), the distribution of \(X\), as \[\begin{equation} \mu(A) = \mu (\mathbb{1}_{A}) = \mathbb{E}[\mathbb{1}_{A}\circ X], \end{equation}\] which is the integral understanding of the distribution of a random variable. A useful application of this corollary is to calculate the p.d.f., i.e., to characterize the distribution, of a random variable when having prior information regarding the target random variable conditioning on other variables. Let us take a look at the following example.

\(\underline{Example\,\,(Distribution\,\,of\,\,random\,\,genotype)}.\) Let \(X\) be the random genotype in the random-allele-frequency model. We want to characterize the distribution of \(X\) but the only information we know is the conditional distribution of \(X\) given individual-allele-frequency \(\pi\), written as \(X|\pi \sim \text{Binomial}(2,\pi)\). Here \(\pi\) comes from some other distributions. A strong candidate for this is BN\((p,f)\). We can use the above corollary to derive the distribution of \(X\). There are three possible outcomes of \(X\) as 0, 1, or 2. For event \(A_0=\{\omega\in \Omega: X(\omega)=0\}\), we have \[\begin{align} \mathbb{P}(X=0) &= \mathbb{P}(X\in A_0)= \mathbb{P}(X\in A_0)\cdot 1 + \mathbb{P}(X\notin A_0)\cdot 0 =\mathbb{E}[\mathbb{1}_{A_0} \circ X] = \mathbb{E}\big[\mathbb{E}[\mathbb{1}_{A_0}\circ X|\pi]\big] \nonumber \\ &= \mathbb{E}\big[\mathbb{E}[\mathbb{1}_{X=0}|\pi]\big] = \mathbb{E}\big[(1-\pi)^2\big] = \cdots \end{align}\] The remaining path can be walked through by following \(\mathbb{E}[Z^2]=(\mathbb{E}[Z])^2+\mathbb{V}(Z)\) while taking \(Z=1-\pi\). The same strategy yields \(\mathbb{P}(X=1)\) and \(\mathbb{P}(X=2)\).

Feb 24, 2022


Feb 22, 2022


Let \(X\) and \(Y\) be two genotypes at the same loci drawn randomly from a population under the identical-by-descent (IBD) model with allele frequency \(p\) and inbreeding coefficient \(f\). We are interested in evaluating the covariance between genotype \(X\) and \(Y\). Recall that IBD is defined based on alleles, i.e., the probability that two alleles are IBD. Hence a basic setup is to decompose \(X\) and \(Y\) into alleles level. Suppose \(X\) is supported by two alleles \(X_1\) and \(X_2\), \(Y\) is supported by two alleles \(Y_1\) and \(Y_2\). Suppose \(X_1\) (and \(Y_1\)) comes from the maternal side and \(X_2\) (and \(Y_2\)) comes from the paternal side. We write \[X=X_1+X_2,\,\,\,\,\,\,Y=Y_1+Y_2,\] thereby deriving their covariance \[\begin{align} \text{Cov}(X,Y) &= \text{Cov}(X_1+X_2, Y_1+Y_2) \nonumber \\ &= \mathbb{E}[(X_1+X_2)(Y_1+Y_2)]-\mathbb{E}[X_1+X_2]\mathbb{E}[Y_1+Y_2] \nonumber \\ &= \mathbb{E}[X_1Y_1]+\mathbb{E}[X_1Y_2]+\mathbb{E}[X_2Y_1]+\mathbb{E}[X_2Y_2] - \mathbb{E}[X_1]\mathbb{E}[Y_1]-\mathbb{E}[X_1]\mathbb{E}[Y_2]-\mathbb{E}[X_2]\mathbb{E}[Y_1]-\mathbb{E}[X_2]\mathbb{E}[Y_2] \nonumber \\ &= (\mathbb{E}[X_1Y_1]-\mathbb{E}[X_1]\mathbb{E}[Y_1]) + (\mathbb{E}[X_1Y_2]-\mathbb{E}[X_1]\mathbb{E}[Y_2]) + (\mathbb{E}[X_2Y_1]-\mathbb{E}[X_2]\mathbb{E}[Y_1]) + (\mathbb{E}[X_2Y_2]-\mathbb{E}[X_2]\mathbb{E}[Y_2]) \nonumber \\ &= \text{Cov}(X_1,Y_1) + \text{Cov}(X_1,Y_2) + \text{Cov}(X_2,Y_1) + \text{Cov}(X_2,Y_2) = \sum_{j=1}^2 \sum_{i=1}^2 \text{Cov}(X_i,Y_j) \nonumber \\ &= 4\text{Cov}(X_1,Y_1). \end{align}\]

The last line holds since all 4 terms of covariance in line 5 are equal.

Feb 15, 2022


Let us begin by proving a simple version of the law of total variance

\[\begin{align} \mathbb{V}(X) &= \mathbb{E}\big[(X-\mathbb{E}[X])^2\big] \nonumber \\ &= \mathbb{E}\big[(X-\mathbb{E}[X|Y]+\mathbb{E}[X|Y]-\mathbb{E}[X])^2\big] \nonumber \\ &= \Big\{\mathbb{E}\big[(X-\mathbb{E}[X|Y])^2\big] + 2\mathbb{E}\big[(X-\mathbb{E}[X|Y])(\mathbb{E}[X|Y]-\mathbb{E}[X])\big] \Big\}+ \mathbb{E}\big[(\mathbb{E}[X|Y]-\mathbb{E}[X])^2\big] \nonumber \\ &= \Big\{\mathbb{E}[X^2]-\mathbb{E}[(\mathbb{E}[X|Y])^2]\Big\} + \mathbb{E}\big[(\mathbb{E}[X|Y]-\mathbb{E}[\mathbb{E}[X|Y]])^2\big] \nonumber \\ &= \mathbb{E}\big[\mathbb{E}[X^2|Y]-(\mathbb{E}[X|Y])^2\big] + \mathbb{V}(\mathbb{E}[X|Y]) \nonumber \\ &= \mathbb{E}[\mathbb{V}(X|Y)] + \mathbb{V}(\mathbb{E}[X|Y]). \end{align}\]

Line 1 (expand the variance/covariance term by definition) and line 2 (little trick to introduce the conditional term) may provide hints for proving the following equation

\[\begin{equation} \text{Cov}(X,Y)=\mathbb{E}[\text{Cov}(X,Y|Z)]+\text{Cov}(\mathbb{E}[X|Z],\mathbb{E}[Y|Z]). \end{equation}\]

More specifically, try to think about two questions: 1) what is the math definition of \(\text{Cov}(X,Y)\); 2) how to introduce the conditional terms into the expanded version of \(\text{Cov}(X,Y)\). More details for deriving the first term in line 4 from line 3

\[\begin{align} \mathbb{E}\big[(X-\mathbb{E}[X|Y])\big]&=\mathbb{E}[X^2]-2\mathbb{E}\big[X\mathbb{E}[X|Y]\big]+\mathbb{E}\big[(\mathbb{E}[X|Y])^2\big] \nonumber \\ 2\mathbb{E}\big[(X-\mathbb{E}[X|Y])(\mathbb{E}[X|Y]-\mathbb{E}[X])\big] &= 2\mathbb{E}\big[X\mathbb{E}[X|Y]\big]-2\mathbb{E}\big[XE[X]\big]-2\mathbb{E}\big[(\mathbb{E}[X|Y])^2\big]+2\mathbb{E}\big[\mathbb{E}[X|Y]\mathbb{E}[X]\big] \nonumber \\ &= 2\mathbb{E}\big[X\mathbb{E}[X|Y]\big]-2(\mathbb{E}[X])^2-2\mathbb{E}\big[(\mathbb{E}[X|Y])^2\big]+2(\mathbb{E}[X])^2 \nonumber \\ &= 2\mathbb{E}\big[X\mathbb{E}[X|Y]\big]-2\mathbb{E}\big[(\mathbb{E}[X|Y])^2\big] \nonumber \\ \Rightarrow \mathbb{E}\big[(X-\mathbb{E}[X|Y])\big]+ 2\mathbb{E}\big[(X-\mathbb{E}[X|Y])(\mathbb{E}[X|Y]-\mathbb{E}[X])\big] &= \mathbb{E}[X^2]-\mathbb{E}\big[(\mathbb{E}[X|Y])^2\big]. \end{align}\]

Courses at Harvard as a TA

STAT215 / STAT115 / BST282 / Introduction to Computational Biology and Bioinformatics

Course ID: STAT 215 / STAT 115 / BST 282

Title: Introduction to Computational Biology and Bioinformatics

Semester: 2019 Spring

Instructor: Dr. X. Shirley Liu

2.1 Course Website: Canvas

2.2 Evaluation: Students’ comments about Prof. Liu and me

2.3 Chapters Instructed by Me:

Lab 5: Bioinformatic Tools for RNA-seq Analysis, DESeq2, and Cluster Computing ( video and material )
Lab 6: Single Cell RNA-seq Analysis and Optimizing I/O Performance on the Cluster ( video 1, video 2 and material )
Lab 7: ChIP-seq data analysis, Bedtools, UCSC Genome Browser, Cistrome and Odyssey ( video and material )
Lab 8: BETA, VCF, ChIP-seq Motifs Analysis ( video and material )
Lab 9: Hidden Markov Model and TCGA Data ( video 1, video 2 and material )
Lab 10: GWAS, HaploReg, RegulomeDB, Regression, and Feature Selection ( video and material )

BST215 / Linear and Longitudinal Regression

Course ID: BST 215

Title: Linear and Longitudinal Regression

Semester: 2018 Summer

Instructor: Dr. Garrett Fitzmaurice

1.1 Course Website: Canvas

1.2 Evaluation: Students’ comments about Prof. Fitzmaurice and me