January 16, 2025
Upcoming lecture series
Lecture 4: Review of Probability and Statistics
Lecture 5: Statistical Inference - two group comparisons
Lecture 6: Statistical Inference - linear regression and ANOVA
Lecture 7: Statistical Inference - multiple linear regression
Lecture 8: Statistical Inference - continuous regression + limma
Lecture 9: Statistical Inference - multiple testing
Our (statistical) goals for this course:
gain a rigorous understanding of core principles of common analyses of high-dimensional biological data
build solid foundation to follow up on specific topics
Learning objectives:
Be familiar with the terminology: describe data as random variables with various types of sampling distributions
Gain intuition for the central concepts: understand how statistical inference and modeling can help us learn about the properties of a population
A framework for generating conclusions about a population from a sample of noisy data
Definition
Variable: an element, feature, or factor that is liable to vary or change
In statistical terminology, a variable is an unknown quantity that we’d like to study
Most research questions can be formulated as: “What’s the relationship between two or more variables?”
Definition
Random Variable (RV): A variable whose value results from the measurement of a quantity that is subject to variation (e.g. the outcome an experiment)
Examples: a coin flip, a dice throw, the expression level of gene X
An RV has a probability distribution
Definition
Probability: A number assigned to an outcome/event that describes the extent to which it is likely to occur
Probability must satisfy certain rules (e.g. be between 0 and 1)
Probability represents the (long-term) frequency of an event
Definition
Probability distribution: A mathematical function that maps outcomes/events to probabilities
Experiment: Toss two coins
Sample space: set of all possible outcomes \(S=\{TT, HT, TH, HH\}\)
Random Variable of interest: number of heads
Outcome | Number of Heads |
---|
Experiment: Toss two coins
Sample space: set of all possible outcomes \(S=\{TT, HT, TH, HH\}\)
Random Variable of interest: number of heads
Outcome | Number of Heads | |
---|---|---|
TT | 0 | |
HT | 1 | |
TH | 1 | |
HH | 2 |
Let:
\(\omega=\) an outcome
\(X(\omega)=\) number of heads in \(\omega\) (RV)
Each possible outcome is associated with a probability
Event: A set of outcomes that satisfy some condition
Each realization of the RV corresponds to an event (e.g. \(X(\omega)=1\) corresponds to the outcomes \(TH\) and \(HT\) )
\(\omega\) | \(X(\omega)\) | Probability | |
---|---|---|---|
TT | 0 | ||
HT | 1 | ||
TH | 1 | ||
HH | 2 |
The probability distribution of the Random Variable \(X\) tells us how likely each event (number of heads) is to occur in the experiment
Event | \(x\) | \(P(X=x)\) |
---|---|---|
, | ||
Note on notation: \(P(X=x)\) can also be written as \(P_X(x)\)
A discrete RV has a countable number of possible values
A continuous RV takes on values in an interval of numbers
survival time
number of chromosomes
mRNA expression level
probability density function (pdf): \[f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]
Parameters: quantities that summarize a population
For convenience, we write \(N(\mu, \sigma^2)\)
When \(\mu=0\) and \(\sigma=1\), this is the Standard Normal distribution \(N(0,1)\)
\[\text{pdf: }f(x|\mu,\sigma^2) = \phi(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]
In coin flip example, we could sum up discrete probabilities - not the case for continuous RVs
The parameter space is the set of all possible values of a parameter
One major goal: to “figure out” (i.e. estimate) the parameter values
Note
A model is a representation that (we hope) approximates the data and (more importantly) the population that the data were sampled from
Caution
If our analysis relies on the assumption that our observations are independent and they are not, our conclusions might be misleading
Experimental design is in part about trying to avoid unwanted dependence
Example of a design with unwanted dependence:
\[f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]
Mean \(=\mu\)
Standard Deviation \(=\sigma\)
For convenience, we write \(N(\mu, \sigma^2)\)
Population parameters are unknown1 (underlying properties of the population)
Estimator: A function (or rule) used to estimate a parameter of interest
Estimate: A particular realization (value) of an estimator
If we are given a sample of \(n\) observations from a normally distributed population, how do we estimate the parameter values \(\mu\) and \(\sigma\)?
Recall \(\mu\) is the mean and \(\sigma\) the standard deviation of the distribution
\[\hat{\mu} = \bar{x} = \frac{x_1 + x_2 + ... + x_n}{n} = \frac{1}{n} \sum_{i=1}^n x_i\]
\[\hat{\sigma} = s = \sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}}\]
Estimators | Parameters | |
---|---|---|
Summarize | Sample | Population (ground truth) |
Value | Computed from data | Unknown |
Notation | \(\hat{\theta}\) | \(\theta\) |
Estimator | Parameter | |
---|---|---|
Summarizes | Sample/data | Population (ground truth) |
Value | \(\bar{x}=\frac{1}{n} \sum_{i=1}^n x_i\) | Unknown |
Notation | \(\hat{\mu}\) | \(\mu\) |
Estimator | Parameter | |
---|---|---|
Summarizes | Sample/data | Population (ground truth) |
Value | \(s=\sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}}\) | Unknown |
Notation | \(\hat{\sigma}\) | \(\sigma\) |
Let’s say we collected a sample from a population we assume to be normal
We estimate the mean \(\large \hat{\mu}=\bar{x}\)
How good is the estimate?
Statistic: any quantity computed from values in a sample
Any function (or statistic) of a sample (data) is a random variable
Thus, any statistic (because it is random) has its own probability distribution function \(\rightarrow\) specifically, we call this the sampling distribution
Example: the sampling distribution of the mean
The sample mean \(\large \bar{x}\) is a RV, so it has a probability or sampling distribution
By the Central Limit Theorem (CLT), we know that the sampling distribution of the mean (of \(n\) observations) is Normal with mean \(\mu_{\bar{X}} = \mu\) and standard deviation \(\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}\)
Warning
The standard deviation is not the same as the standard error
Standard error describes variability across multiple samples in a population
Standard deviation describes variability within a single sample
The sampling distribution of the mean of \(n\) observations (by CLT): \[\bar{X} \sim N(\mu, \frac{\sigma^2}{n})\]
The standard error of the mean is \(\frac{\sigma}{\sqrt{n}}\)
The standard deviation of \(X\) is \(\sigma\)
Just as we estimated \(\mu\) and \(\sigma\) for our sample of \(n\) observations from a normally distributed population before, we can also estimate \(\mu_{\bar{X}}\) and \(\sigma_{\bar{X}}\):
\(\hat{\mu}_{\bar{X}} = \hat{\mu} = \bar{x}\)
\(\hat{\sigma}_{\bar{X}} = \frac{\hat{\sigma}}{\sqrt{n}} = \frac{s}{\sqrt{n}}\)
\[\large\hat{\sigma}_{\bar{X}} = \frac{\hat{\sigma}}{\sqrt{n}} = \frac{s}{\sqrt{n}}\]
The standard error (SE) of the mean reflects uncertainty about our estimate of the population mean \(\large\hat{\mu}\)
For the distributional assumptions to hold, the CLT assumes a ‘large enough’ sample:
Rule of thumb: when the sample size is ~30 or more, the normal distribution is a good approximation for the sampling distribution of the mean
for smaller samples, the SE \(\large\frac{s}{\sqrt{n}}\) is an underestimate
…regardless of distribution
Let \(\normalsize X_1, X_2, ..., X_n\) be a random sample from a population with a non-normal distribution
If the sample size \(\normalsize n\) is sufficiently large, then the sampling distribution of the mean will be approximately normal: \(\normalsize \bar{X} \sim N(\mu, \frac{\sigma^2}{n})\)
On right: dashed pink line is \(N(\mu, \sigma^2/n)\)
On right: dashed pink line is \(N(\mu, \sigma^2/n)\)
On right: dashed pink line is \(N(\mu, \sigma^2/n)\)
On right: dashed pink line is \(N(\mu, \sigma^2/n)\)
Hypothesis: A testable (falsifiable) idea for explaining a phenomenon
Statistical hypothesis: A hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables
Hypothesis Testing: A formal procedure for determining whether to accept or reject a statistical hypothesis
Requires comparing two hypotheses:
\(H_0\): null hypothesis
\(H_A\) or \(H_1\): alternative hypothesis
Formulate your hypothesis as a statistical hypothesis
Define a test statistic \(t\) (RV) that corresponds to the question. You need to know the expected distribution of the test statistic under the null
Compute the p-value associated with the observed test statistic under the null distribution \(\normalsize p(t | H_0)\)
Which panel looks most significant?
Which looks least significant?
Mean difference needs to be put into context of the __________________ and __________________. Recall the formula for the sampling distribution of the mean:
\[\normalsize t=\frac{\bar{z}-\bar{y}}{SE_{\bar{z}-\bar{y}}}\] e.g. for \(z_1, z_2, ..., z_n\) expression measurements in healthy samples and \(y_1, y_2, ..., y_m\) cancer samples
If we assume:
\(\bar{Z}\) and \(\bar{Y}\) are normally distributed
\(Z\) and \(Y\) have equal variance
Then the standard error estimate for the difference in means is:
\[SE_{\bar{z}-\bar{y}} = s_p \sqrt{\frac{1}{n} + \frac{1}{m}} \text{ , where } s_p^2 = \frac{(n-1)s^2_z + (m-1)s^2_y}{(n-1) + (m-1)}\]
And our t-statistic follows a t distribution with m+n-2 degrees of freedom \[t \sim t_{n+m-2}\]
(Alternative formulations for unequal variance setting)
statistic value tells us how extreme our observed data is relative to the null
obtain p-value by computing area to the left and/or right of the t statistic (one-sided vs two-sided)
Random variables are variables that have a probability distribution
Any statistic of sampled data is a RV, and hence has an associated probability distribution
The CLT gives us the sampling distribution of the mean of any RV (regardless of its distribution)
We can use statistical inference to estimate population parameters from a sample
Hypothesis testing gives us a framework to assess a statistical hypothesis