Continuous models and intro to limma

Keegan Korthauer

February 6, 2024

Laying the foundation of linear models

  • t -tests can be used to test the equality of 2 population means

  • ANOVA can be used to test the equality of more than 2 population means

  • Linear regression provides a general framework for modeling the relationship between a response variable and different types of explanatory variables

    • t-tests can be used to test the significance of individual coefficients

    • F-tests can be used to test the simultaneous significance of multiple coefficients (e.g. multiple levels of a single categorical factor, or multiple factors at once)

    • F-tests are used to compare nested models (overall effects or goodness of fit)

Learning objectives for today

  1. Understand how linear regression represents continuous variables:

    1. Be familiar with the intuition behind how the regression line is estimated (Ordinary Least Squares)

    2. Interpret parameters in a multiple linear regression model with continuous and factor variables

  2. Explain the motivation behind specialized regression models in high-dimensional settings

    1. Specifically, the Empirical Bayes techniques in limma

What if we treat age as a continuous variable?

Code
library(GEOquery)

# read in our dataset
eset <- getGEO("GSE4051", getGPL = FALSE)[[1]]

# recode time points
pData(eset) <- pData(eset) %>%
  mutate(sample_id = geo_accession) %>%
  mutate(dev_stage =  case_when(
    grepl("E16", title) ~ "E16",
    grepl("P2", title) ~ "P2",
    grepl("P6", title) ~ "P6",
    grepl("P10", title) ~ "P10",
    grepl("4 weeks", title) ~ "P28"
  )) %>%
  mutate(genotype = case_when(
    grepl("Nrl-ko", title) ~ "NrlKO",
    grepl("wt", title) ~ "WT"
  ))

# reorder factor levels, add continous age variable
pData(eset) <- pData(eset) %>%
  mutate(dev_stage = fct_relevel(dev_stage, "E16", "P2", "P6", "P10", "P28")) %>%
  mutate(genotype = as.factor(genotype)) %>%
  mutate(genotype = fct_relevel(genotype, "WT", "NrlKO")) %>%
  mutate(age = ifelse(dev_stage == "E16", -4,
                            ifelse(dev_stage == "P2", 2, 
                                   ifelse(dev_stage == "P6", 6, 
                                          ifelse(dev_stage == "P10", 10, 28)))))

# function to return tidy data that merges expression matrix and metadata
toLongerMeta <- function(expset) {
    stopifnot(class(expset) == "ExpressionSet")
    
    expressionMatrix <- lonExpressionressionMatrix <- exprs(expset) %>% 
    as.data.frame() %>%
    rownames_to_column("gene") %>%
    pivot_longer(cols = !gene, 
                 values_to = "expression",
                 names_to = "sample_id") %>%
    left_join(pData(expset) %>% select(sample_id, dev_stage, age, genotype),
            by = "sample_id")
  return(expressionMatrix)
}

# pull out two genes of interest
twoGenes <- toLongerMeta(eset) %>% 
  filter(gene %in% c("1456341_a_at", "1441811_x_at")) %>%
  mutate(gene = ifelse(gene == "1456341_a_at", "Klf9", "Tmem176a")) 

# make some plots - first with age continuous
Klf9_C <- ggplot(filter(twoGenes, gene == "Klf9"), 
                  aes(x = age, y = expression)) + 
             geom_point(alpha = 0.7) +
             labs(title = "Klf9") +
             theme(legend.position = "none") +
             ylim(5, 11) +
             xlab("age (days)") 

Tmem176a_C <- ggplot(filter(twoGenes, gene == "Tmem176a"), 
                 aes(x = age, y = expression)) + 
             geom_point(alpha = 0.7) +
             labs(title = "Tmem176a") +
             ylim(5, 11) +
             xlab("age (days)") 

# next with age categorical
Klf9 <- ggplot(filter(twoGenes, gene == "Klf9"), 
                  aes(x = dev_stage, y = expression)) + 
             geom_jitter(width = 0, alpha = 0.7) +
             theme(legend.position = "none") +
             labs(title = "Klf9") +
             ylim(5, 11) +
             xlab("Developmental Stage")

Tmem176a <- ggplot(filter(twoGenes, gene == "Tmem176a"), 
                 aes(x = dev_stage, y = expression)) + 
             geom_jitter(width = 0, alpha = 0.7) +
             labs(title = "Tmem176a") +
             ylim(5, 11) +
             xlab("Developmental Stage") 

grid.arrange(Klf9 + stat_summary(aes(group=1), fun = mean, geom="line", colour="grey"), 
             Tmem176a + stat_summary(aes(group=1), fun = mean, geom="line", colour="grey") + ylab(""), 
             Klf9_C + stat_summary(aes(group=1), fun = mean, geom="line", colour="grey"), 
             Tmem176a_C + stat_summary(aes(group=1), fun = mean, geom="line", colour="grey") + ylab(""), 
             nrow = 2)

Linear model with age as continuous covariate

Code
grid.arrange(Klf9_C + stat_summary(aes(group=1), fun =mean, geom="line", colour="grey"), 
             Tmem176a_C + stat_summary(aes(group=1), fun =mean, geom="line", colour="grey") + ylab(""), 
             nrow = 1)
  • Linear looks reasonable for gene Tmem176a, but not so much for Klf9

  • For now, assume linear is reasonable

Simple Linear Regression (Matrix form)

\[ \mathbf{Y = X \boldsymbol\alpha + \boldsymbol\varepsilon}\]

For 1 continuous/quantitative covariate:

\[\mathbf{Y} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{bmatrix}, \hspace{1em} \mathbf{X} = \begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \\ \end{bmatrix}, \hspace{1em} \boldsymbol\alpha = \begin{bmatrix} \alpha_{0} \\ \alpha_{1} \\ \end{bmatrix}, \hspace{1em} \boldsymbol\varepsilon=\begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \\ \end{bmatrix}\]

  • \(\alpha_0=\) the intercept (expected value of \(y\) when \(x\) is equal to zero)

  • \(\alpha_1=\) the slope (expected change in \(y\) for every one-unit increase in \(x\))

Simple Linear Regression (Matrix form)

\[ \mathbf{Y = X \boldsymbol\alpha + \boldsymbol\varepsilon}\]

Remember / convince yourself that the matrix algebra yields simple linear equations: \[\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{bmatrix}=\begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \\ \end{bmatrix}\begin{bmatrix} \alpha_{0} \\ \alpha_{1} \\ \end{bmatrix}+\begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \\ \end{bmatrix}=\begin{bmatrix} 1*\alpha_0 + x_{1}\alpha_{1} \\ 1*\alpha_0 + x_{2}\alpha_{1} \\ \vdots \\ 1*\alpha_0 + x_{n}\alpha_{1} \\ \end{bmatrix}+\begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \\ \end{bmatrix}\]

\[=\begin{bmatrix} \alpha_0 + x_{1}\alpha_{1} + \varepsilon_{1} \\ \alpha_0 + x_{2}\alpha_{1} + \varepsilon_{2}\\ \vdots \\ \alpha_0 + x_{n}\alpha_{1} + \varepsilon_{n} \\ \end{bmatrix}\] \[\Rightarrow y_i = \alpha_0 + x_i\alpha_1 + \varepsilon_i\]

SLR with continuous age covariate

Tmem_fit <- filter(twoGenes, gene == "Tmem176a") %>%
  lm(expression ~ age, data = .)
tidy(Tmem_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   8.10     0.114       71.1  3.58e-41
2 age          -0.0614   0.00821     -7.47 6.74e- 9

Interpretation of intercept:

\(H_0: \alpha_0 = 0\) tests the null hypothesis that the intercept is zero - usually, not of interest

SLR with continuous age covariate

tidy(Tmem_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   8.10     0.114       71.1  3.58e-41
2 age          -0.0614   0.00821     -7.47 6.74e- 9

Interpretation of slope:

\(H_0: \alpha_1 = 0\) tests the null hypothesis that there is no association between gene expression and age - usually of interest

How do we estimate the intercept and slope?

Why is this the optimal line?

tidy(Tmem_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   8.10     0.114       71.1  3.58e-41
2 age          -0.0614   0.00821     -7.47 6.74e- 9

Which one is the best line?

Ordinary Least Squares

  • Ordinary Least Squares (OLS) regression: parameter estimates minimize the sum of squared errors

  • Error: vertical \((y)\) distance between the true regression line (unobserved) and the real observation

  • Residual: vertical \((y)\) distance between the fitted regression line and the real observation (estimated error)

OLS Estimator for Simple Linear Regression (1 covariate)

  • Mathematically: \(\varepsilon_i\) represents the error: \(\varepsilon_i = y_i - \alpha_0 - \alpha_1x_i, i = 1, ..., n\)

  • We want to find the line (i.e. intercept and slope) that minimizes the sum of squared errors: \[S(\alpha_0, \alpha_1)= \sum_{i=1}^n (y_i - \alpha_0 - \alpha_1 x_i)^2\]

    • \(S(\alpha_0, \alpha_1)\) is called an objective function

    • the observed sum of squared errors is referred to as Residual Sum of Squares (RSS)

Error vs Residual

  • Error refers to the deviation of the observed value to the (underlying) true quantity of interest

  • Residual refers to the difference between the observed and estimated quantity of interest

  • How to obtain estimates \((\hat{\alpha}_0, \hat{\alpha}_1)\) ? Let’s look at a more general case

OLS for Multiple Linear Regression (p covariates)

\[S(\alpha_0, \alpha_1, \alpha_2, ..., \alpha_p) = \sum_{i=1}^n (y_i - \alpha_0 - \alpha_1 x_{1i} - \alpha_2 x_{2i} - ... - \alpha_p x_{pi})^2\] \[=(\mathbf{y}-\mathbf{X}\boldsymbol\alpha)^T(\mathbf{y}-\mathbf{X}\boldsymbol\alpha)\]

  • We need to find values of \(\boldsymbol\alpha=(\alpha_0, \alpha_1, ..., \alpha_p)\) that minimize the sum of squares \(S\)
  • Take partial derivatives with respect to each coeff, set to zero, solve system of equations:

\[\small \frac{\partial S}{\partial\boldsymbol\alpha}=\begin{bmatrix} \frac{\partial S}{\partial\alpha_0} \\ \frac{\partial S}{\partial\alpha_1} \\ \vdots\\ \frac{\partial S}{\partial\alpha_p} \\ \end{bmatrix}=\begin{bmatrix} 0 \\ 0 \\ \vdots\\ 0 \\ \end{bmatrix}\]

Sums of squares for Tmem176a

Fixing the intercept at 8.1000793, let’s plot the RSS values for a range of possible slope values.

Code
rss <- function(intercept, slope, x, y, data){
  resid <- pull(data, y) - (intercept + slope*pull(data, x))
  return(sum(resid^2))
}

slopes = seq(-0.2, 0.2, len = 400)
results <- data.frame(slope = slopes,
                      RSS = sapply(slopes, 
                                   rss, 
                                   intercept = Tmem_fit$coefficients[1],
                                   x = "age", 
                                   y = "expression",
                                   data = filter(twoGenes, gene == "Tmem176a")))
best.i <- which.min(results$RSS)
ggplot(results, aes(slope, RSS)) + 
  geom_line() +
  geom_point(data = results[best.i,], colour = "red", size = 2)

OLS interactive demo

Launch demo

Explore

  1. In the first plot, drag individual points around and observe what changes in second plot

  2. In the second plot, adjust the slope and intercept dials - what happens to the total area of the squares in the second plot when you modify the slope and intercept from the default values?

    • Note that you can reset to default values by refreshing the page

Properties of OLS regression

Regression model: \(\mathbf{Y = X \boldsymbol\alpha + \boldsymbol\varepsilon}\)

OLS estimator: \(\hat{\boldsymbol\alpha} =(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

Fitted/predicted values: \(\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol\alpha} = \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{H}\mathbf{y}\)

where \(\mathbf{H}=\mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) is called the “hat matrix”

Assumptions of OLS Regression

  1. \(\boldsymbol\varepsilon\) have mean zero

  2. \(\boldsymbol\varepsilon\) are iid (implies constant variance)

  3. (only required for hypothesis testing in small sample settings) \(\boldsymbol\varepsilon\) are Normally distributed

Connection to other estimators

If \(\boldsymbol\varepsilon\) are iid Normal, then OLS estimator is also MLE (Maximum Likelihood Estimator)

Properties of OLS regression (cont’d)

Residuals: (note NOT the same as errors \(\boldsymbol\varepsilon\)) \[\hat{\boldsymbol\varepsilon} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol\alpha}\]

Estimated error variance: \[\hat{\sigma}^2 = \frac{1}{n-p} \hat{\boldsymbol\varepsilon}^T \hat{\boldsymbol\varepsilon}\]

Estimated covariance matrix of \(\hat{\boldsymbol\alpha}\): \[\hat{Var}(\hat{\boldsymbol\alpha}) = \hat{\sigma}^2(\mathbf{X}^T\mathbf{X})^{-1}\]

Estimated standard errors for estimated regression coefficients: \(\hat{se}(\hat{\alpha}_j)\), obtained by taking the square root of the diagonal elements of \(\hat{Var}(\hat{\boldsymbol\alpha})\)

Inference in Regression (normal iid errors)

How to test \(H_0: \alpha_j = 0\)?

With a t-test!

Under \(H_0\),

\[\frac{\hat{\alpha}_j}{\hat{se}(\hat{\alpha}_j)} \sim t_{n-p}\]

So a p-value is obtained by computing a tail probability for the observed value of \(\hat{\alpha}_j\) from a \(t_{n-p}\) distribution

Inference - what if we don’t assume Normal errors?

How to test \(H_0: \alpha_j = 0\)?

Assuming large enough sample size, with a t-test!

Under \(H_0\), asymptotically (by CLT)

\[\frac{\hat{\alpha}_j}{\hat{se}(\hat{\alpha}_j)} \sim t_{n-p}\]

So with a large enough sample size a p-value for this hypothesis test is obtained by computing a tail probability for the observed value of \(\hat{\alpha}_j\) from a \(t_{n-p}\) distribution

Diagnostics plots

Do our assumptions hold?

  • Constant variance
  • iid errors
  • Normality of errors
# requires package 'ggResidpanel'
ggResidpanel::resid_panel(Tmem_fit)

Linear regression

The nature of the regression function \(y=f(x|\boldsymbol\alpha)\) is one of the defining characteristics of a regression model

  1. If \(f\) is not linear in \(\boldsymbol\alpha \Rightarrow\) nonlinear model

    • For example, consider nonlinear parametric regression: \[ y_i = \frac{1}{1 + e^{\alpha_0 + \alpha_1 x_i}} + \varepsilon_i\]
  2. If \(f\) is linear in \(\boldsymbol\alpha \Rightarrow\) linear model

    • We just examined simple linear regression (a linear model): \(y_i = \alpha_0 + \alpha_1x_i + \varepsilon_i\)

    • What we could do instead: polynomial regression (also a linear model) \[y_i = \alpha_0 + \alpha_1x_i + \alpha_2x_i^2 + \varepsilon_i\]

Polynomial regression

oneGene <- toLongerMeta(eset) %>% 
  filter(gene %in% c("1427275_at")) %>%
  mutate(gene = "Smc4")
lm(expression ~ age + I(age^2), data = oneGene) %>% tidy()
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  8.48      0.161       52.7  1.11e-35
2 age         -0.147     0.0326      -4.52 6.52e- 5
3 I(age^2)     0.00501   0.00116      4.30 1.23e- 4
Code
Smc4 <- ggplot(oneGene, aes(x = age, y = expression)) + 
  geom_point(alpha = 0.7) +
  labs(title = "Smc4") +
  theme(legend.position = "none") +
  ylim(6, 11) +
  xlab("age (days)") + 
  stat_smooth(method="lm", se=FALSE, fill=NA, 
              formula=y ~ poly(x, 2),colour="blue")

Smc4

Important

Note that this is still a linear model, because it is linear in the \(\alpha_j\).

Hint: think about what the design matrix looks like in this case.

Putting it all together (continuous + categorical variables)

Code
Tmem176a_B <- ggplot(filter(twoGenes, gene=="Tmem176a"), 
                  aes(x = age, y = expression, colour = genotype)) + 
             geom_point(alpha = 0.7) +
             labs(title = "Tmem176a") +
             ylim(5, 10) +
             xlab("age (days)") + 
             stat_smooth(method="lm", se=FALSE)
Tmem176a_B 

Interaction between continuous and categorical variables

lm(expression ~ age*genotype, data = filter(twoGenes, gene=="Tmem176a")) %>% 
  tidy()
# A tibble: 4 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        8.03       0.157     51.3   1.57e-34
2 age               -0.0665     0.0114    -5.82  1.33e- 6
3 genotypeNrlKO      0.142      0.228      0.623 5.37e- 1
4 age:genotypeNrlKO  0.00987    0.0164     0.600 5.52e- 1

(Intercept): Intercept of WT (reference) line

age: slope of WT (reference) line

genotypeNrlKO: difference in intercepts (KO vs WT)

age:genotypeNrlKO: difference in slopes (KO vs WT)

Interpreting the Intercept

Important

Intercept terms refer to the estimates when the continuous covariate is equal to zero. This is not usually very interesting on its own

Interaction between continuous and categorical variables

\[y_{ij} = \alpha_{0} + \tau_{KO}x_{ij, KO} + \tau_{Age} x_{ij, Age} + \tau_{KO:Age} x_{ij, KO}x_{ij, Age}\]

where

  • \(j \in \{ WT, NrlKO\}\), \(i = 1,2,...,n_j\)
  • \(x_{ij, KO}\) is the indicator variable for WT vs KO ( \(x_{ij, KO}=1\) for \(j=NrlKO\) and 0 for \(j=WT\) )
  • \(x_{ij, Age}\) is the continuous age covariate

Interpretation of parameters:

  • \(\alpha_0\) is the expected expression in WT for age = 0
  • The “intercept” for the knockouts is: \(\alpha_0 + \tau_{KO}\)
  • \(\tau_{Age}\) is the expected increase in expression in WT for every 1 day increase in age
  • The slope for the knockouts is: \(\tau_{Age} + \tau_{KO:Age}\)

Nested models

As always, you can assess the relevance of several terms at once (e.g. everything involving genotype) with an F-test:

Klf9dat <- filter(twoGenes, gene=="Klf9")
anova(lm(expression ~ age*genotype, data = Klf9dat),
      lm(expression ~ age, data = Klf9dat))
Analysis of Variance Table

Model 1: expression ~ age * genotype
Model 2: expression ~ age
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     35 45.948                           
2     37 45.984 -2 -0.036415 0.0139 0.9862

Conclusion

We don’t have evidence that genotype affects the intercept or the slope

F-tests in regression

Model Example # params (df) RSS
Reduced expression ~ age \(p_{Red}=2\) \(RSS_{Red}\)
Full expression ~ age * genotype \(p_{Full}=4\) \(RSS_{Full}\)

Full: \(y_{ij} = \alpha_{0} + \tau_{KO}x_{ij, KO} + \tau_{Age} x_{ij, Age} + \tau_{KO:Age} x_{ij, KO}x_{ij, Age}\)

Reduced: \(y_{ij} = \alpha_{0} + \tau_{Age} x_{ij, Age}\)

Under \(H_0:\) the reduced model explains the same amount variation in the outcome as the full,

\[F = \frac{\frac{RSS_{Red}-RSS_{Full}}{p_{Full}-p_{Red}}}{\frac{RSS_{Full}}{n-p_{Full}}} \sim \text{ } F_{p_{Fill}-p_{Red},\text{ } n-{p_{Full}}}\] A significant F-test means we reject the null; we have evidence that the full model explains significantly more variation in the outcome than the reduced.

Collinearity & confounding

  • If there are problems in the experimental design, we may not be able to estimate effects of interest

Caution

The technical definition of collinearity is that a column of the design matrix can be obtained (or accurately approximated) as a linear combination of other columns.

You can think of this as one column (or variable) not containing unique information

  • This phenomenon can occur if our design is confounded

  • As an example, let’s pretend we know that all the E16 and P2 mice are female and the rest are male

Example: E16/P2 mice are female and the rest are male

Klf9dat_conf <- Klf9dat %>%
  mutate(sex = ifelse(dev_stage %in% c("E16", "P2"), "female", "male"))
table(Klf9dat_conf$sex, Klf9dat_conf$dev_stage)
        
         E16 P2 P6 P10 P28
  female   7  8  0   0   0
  male     0  0  8   8   8
(mm <- model.matrix( ~ sex + dev_stage, data = Klf9dat_conf))
   (Intercept) sexmale dev_stageP2 dev_stageP6 dev_stageP10 dev_stageP28
1            1       1           0           0            0            1
2            1       1           0           0            0            1
3            1       1           0           0            0            1
4            1       1           0           0            0            1
5            1       0           0           0            0            0
6            1       0           0           0            0            0
7            1       0           0           0            0            0
8            1       1           0           0            1            0
9            1       1           0           0            1            0
10           1       1           0           0            1            0
11           1       1           0           0            1            0
12           1       0           1           0            0            0
13           1       0           1           0            0            0
14           1       0           1           0            0            0
15           1       0           1           0            0            0
16           1       1           0           1            0            0
17           1       1           0           1            0            0
18           1       1           0           1            0            0
19           1       1           0           1            0            0
20           1       1           0           0            0            1
21           1       1           0           0            0            1
22           1       1           0           0            0            1
23           1       1           0           0            0            1
24           1       0           0           0            0            0
25           1       0           0           0            0            0
26           1       0           0           0            0            0
27           1       0           0           0            0            0
28           1       1           0           0            1            0
29           1       1           0           0            1            0
30           1       1           0           0            1            0
31           1       1           0           0            1            0
32           1       0           1           0            0            0
33           1       0           1           0            0            0
34           1       0           1           0            0            0
35           1       0           1           0            0            0
36           1       1           0           1            0            0
37           1       1           0           1            0            0
38           1       1           0           1            0            0
39           1       1           0           1            0            0
attr(,"assign")
[1] 0 1 2 2 2 2
attr(,"contrasts")
attr(,"contrasts")$sex
[1] "contr.treatment"

attr(,"contrasts")$dev_stage
[1] "contr.treatment"
# pull out column corresponding to male sex indicator variable
mm[,"sexmale"] 
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 1  1  1  1  0  0  0  1  1  1  1  0  0  0  0  1  1  1  1  1  1  1  1  0  0  0 
27 28 29 30 31 32 33 34 35 36 37 38 39 
 0  1  1  1  1  0  0  0  0  1  1  1  1 
# construct linear combination from columns corresponding to male timepoint indicator variables
mm[,"dev_stageP6"] + mm[,"dev_stageP10"] + mm[,"dev_stageP28"]
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 1  1  1  1  0  0  0  1  1  1  1  0  0  0  0  1  1  1  1  1  1  1  1  0  0  0 
27 28 29 30 31 32 33 34 35 36 37 38 39 
 0  1  1  1  1  0  0  0  0  1  1  1  1 

lm can’t estimate all the parameters in this collinear/confounded design.

lm(expression ~ sex + dev_stage, data = Klf9dat_conf) %>%
  tidy()
# A tibble: 6 × 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     7.14      0.240    29.7    6.66e-26
2 sexmale         2.85      0.329     8.64   4.25e-10
3 dev_stageP2    -0.815     0.329    -2.48   1.84e- 2
4 dev_stageP6    -1.20      0.318    -3.76   6.44e- 4
5 dev_stageP10   -0.225     0.318    -0.707  4.84e- 1
6 dev_stageP28   NA        NA        NA     NA       

Hidden confounding

Caution

Hidden confounders are variables with associations to both the outcome and predictor variable(s), but are not measured (or not included in the model).

They can cause spurious correlations, and illustrate why correlation does not imply causation.

Extreme example: trend reverses with/without confounder

image source: Simply Statistics

Linear regression summary

  • linear model framework is extremely general

  • one extreme (simple): two-sample common variance t-test

  • another extreme (flexible): a polynomial, potentially different for each level of some factor

    • dichotomous predictor? 👍

    • categorical predictor? 👍

    • quantitative predictor? 👍

    • various combinations of the above? 👍

  • Don’t be afraid to build models with more than 1 covariate

  • later, we’ll talk about extensions to discrete outcomes (e.g. dichotomous or counts) via generalized linear models

What about the other 45 thousand probesets??

eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 45101 features, 39 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM92610 GSM92611 ... GSM92648 (39 total)
  varLabels: title geo_accession ... age (40 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 16505381 
Annotation: GPL1261 

Linear regression of many genes

\[\Large \mathbf{Y}_g = \mathbf{X}_g \boldsymbol\alpha_g + \boldsymbol\varepsilon_g\]

  • The g in the subscript reminds us that we’ll be fitting a model like this for each gene g that we have measured for all samples

  • Most of the time, the design matrices \(\mathbf{X}_g\) are, in fact, the same for all \(g\). This means we can just use \(\mathbf{X}\)

  • Note this means the residual degrees of freedom are also the same for all \(g\) \[d_g = d = n - \text{dimension of } \boldsymbol\alpha = n-p\]

Linear regression of many genes (cont’d)

  • Data model: \(\large\mathbf{Y}_g = \mathbf{X} \boldsymbol\alpha_g + \boldsymbol\varepsilon_g\)

  • Unknown error variance: \(\large Var(\boldsymbol\varepsilon_g) = \sigma^2_g\)

  • Estimated error variance: \(\large\hat{\sigma}^2_g = s^2_g = \frac{1}{n-p}\hat{\boldsymbol\varepsilon_g}^T\hat{\boldsymbol\varepsilon_g}\)

  • Estimated variance of parameter estimates: \(\large\hat{Var}(\hat{\boldsymbol\alpha_g}) = s^2_g (\mathbf{X}^T\mathbf{X})^{-1} = s^2_g \mathbf{V}\)

    • \(\bf{V}\) is the “unscaled covariance” matrix, and is the same for all genes!

    • Estimated standard errors for estimated regression coefficients: \(\large\hat{se}(\hat{\alpha}_{jg})\) obtained by taking the square root of the \(j^{th}\) diagonal element of \(\hat{Var}(\hat{\boldsymbol\alpha}_g)\), which is \(s_g\sqrt{v_{jj}}\)

What’s the big deal?

So far, nothing is new - these are the “regular” t statistics for gene g and parameter j:

\[t_{gj} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}} \sim t_{d} \text{ under } H_0\]

But there are so many of them!! 😲

Observed (i.e. empirical) issues with the “standard” t-test approach for assessing differential expression

Important

Some genes with very small p-values (i.e. large -log10 p-values) are not biologically meaningful (small effect size, e.g. fold change)

How do we end up with small p-values but subtle effects?

\[\large t_{gj} = \frac{\hat{\alpha}_{gj}}{\hat{se}(\hat{\alpha}_{gj})} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}} \sim t_d \text{ under } H_0\]

  • Small variance estimate \(s_g\) leads to large t statistic \(\rightarrow\) small p-value
  • Recall: estimates of sample variance from small sample sizes tend to under-estimate the true variance!
  • This has led to the development of specialized methodology for assessing genome-wide differential expression

Empirical Bayesian techniques: limma

Why use limma instead of regular t-tests?

  • Borrows information from all genes to get a better estimate of the variance (especially in smaller sample size settings)

  • Efficiently fits many regression models without replicating unnecessary calculations!

  • Arranges output in a convenient way to ease further analysis, visualization, and interpretation

How does Empirical Bayes work?

  • Empirical: observed

  • Bayesian: incorporate ‘prior’ information

  • Intuition: estimate prior information from data; shrink (nudge) all estimates toward the consensus

Shrinkage = borrowing information across all genes

Genome-wide OLS fits

  • Gene by gene:

    • lm(y ~ x, data = gene) for each gene

    • For example, using dplyr::group_modify and broom::tidy

  • All genes at once, using limma:

    • lmFit(object, design)

    • object matrix-like object with expression values for all genes

    • design is a specially formatted design matrix (more on this later)

    • Note that object can be a Bioconductor container, such as an ExpressionSet object

‘Industrial scale’ model fitting is good, because computations involving just the design matrix \(\mathbf{X}\) are not repeated tens of thousands of times unnecessarily:

  • OLS estimator: \[\hat{\boldsymbol\alpha}_g =(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}_g\]

  • Fitted/predicted values: \[\hat{\mathbf{y}}_g = \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}_g = \mathbf{H}\mathbf{y}_g\]

OLS of first 2000 genes, using lm gene by gene

allGenes <- toLongerMeta(eset) 
allGenes %>% head(10)
# A tibble: 10 × 6
   gene       sample_id expression dev_stage   age genotype
   <chr>      <chr>          <dbl> <fct>     <dbl> <fct>   
 1 1415670_at GSM92610        7.11 P28          28 NrlKO   
 2 1415670_at GSM92611        7.32 P28          28 NrlKO   
 3 1415670_at GSM92612        7.42 P28          28 NrlKO   
 4 1415670_at GSM92613        7.35 P28          28 NrlKO   
 5 1415670_at GSM92614        7.24 E16          -4 NrlKO   
 6 1415670_at GSM92615        7.34 E16          -4 NrlKO   
 7 1415670_at GSM92616        7.38 E16          -4 NrlKO   
 8 1415670_at GSM92617        7.22 P10          10 NrlKO   
 9 1415670_at GSM92618        7.22 P10          10 NrlKO   
10 1415670_at GSM92619        7.12 P10          10 NrlKO   
(t.limma <- system.time(lmfits <- allGenes %>%
    filter(gene %in% unique(allGenes$gene)[1:2000]) %>%
    group_by(gene) %>% 
    group_modify(~ tidy(lm(expression ~ age + genotype, 
                           data = .x))) %>%
    select(gene, term, estimate) %>%
    pivot_wider(names_from = term, 
                values_from = estimate)))
   user  system elapsed 
  6.369   0.062   6.447 
lmfits %>% head() 
# A tibble: 6 × 4
# Groups:   gene [6]
  gene         `(Intercept)`       age genotypeNrlKO
  <chr>                <dbl>     <dbl>         <dbl>
1 1415670_at            7.22  0.000623     -0.000286
2 1415671_at            9.32 -0.00184       0.145   
3 1415672_at            9.76 -0.00393      -0.0422  
4 1415673_at            8.40  0.00398      -0.0436  
5 1415674_a_at          8.52 -0.00598       0.0192  
6 1415675_at            9.67 -0.00642       0.133   

OLS of all genes at once, using limma:

(t.ols <- system.time( limmafits <- 
  lmFit(eset, model.matrix(~ age + genotype, data = pData(eset)))))
   user  system elapsed 
  0.152   0.039   0.190 
limmafits$coefficients %>% head()
             (Intercept)           age genotypeNrlKO
1415670_at      7.217851  0.0006225228 -0.0002861005
1415671_at      9.320083 -0.0018405479  0.1446053811
1415672_at      9.759959 -0.0039281143 -0.0421705559
1415673_at      8.404053  0.0039777804 -0.0436443351
1415674_a_at    8.517675 -0.0059840405  0.0192159017
1415675_at      9.665691 -0.0064185855  0.1330272055

So far, no shrinkage.

Tip

Avoiding repetitive calculations involving \(\mathbf{X}\) leads to ~930X faster computation in this example

How can we better estimate the SE?

\[\large t_{gj} = \frac{\hat{\alpha}_{gj}}{\hat{se}(\hat{\alpha}_{gj})} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}} \sim t_d \text{ under } H_0\]

Important

Under-estimated variance leads to overly large t statistic, which leads to artificially small p-value

Modeling in limma

limma assumes that for each gene \(g\)

\[\hat{\alpha}_{gj} \,|\,\alpha_{gj}, \sigma_g^2 \sim N(\alpha_{gj}, \sigma_g^2 v_{jj})\] \[s^2_g \,|\, \sigma_g^2 \sim \frac{\sigma_g^2}{d}\chi^2_d\] which are the same as the usual assumptions about ordinary \(t\)-statistics:

\[\large t_{gj} = \frac{\hat{\alpha}_{gj}}{\hat{se}(\hat{\alpha}_{gj})} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}} \sim t_d \text{ under } H_0\]

So far, nothing new…

Modeling in limma - shrinkage

  • limma imposes a hierarchical model for how the gene-wise \(\alpha_{gj}\) and \(\sigma^2_g\) vary across genes

Important

Under the limma framework, we are no longer considering genes in isolation. We will leverage information across genes to obtain improved estimates.

  • this is done by assuming a prior distribution for those quantities

  • Prior distribution for gene-specific variances \(\sigma^2_g\): inverse \(\chi^2\) with mean \(s_0^2\), \(d_0\) df:

\[\frac{1}{\sigma^2_g} \sim \frac{1}{d_0s_0^2} \chi^2_{d_0}\]

  • this should feel a bit funny compared to previous lectures - \(\sigma^2_g\) is no longer a fixed quantity! (i.e. this is Bayesian)

How does this help us better estimate the variance?

  • The posterior distribution is an updated version of the observed likelihood based on incorporating the prior information

  • The posterior mean for gene-specific variance:

\[\tilde{s}^2_g = \frac{d_0s_0^2 + ds^2_g}{d_0 + d}\]

  • A weighted mean of the prior and the observed gene-specific variances:

\[\tilde{s}^2_g = \frac{d_0}{d_0 + d}s_0^2 + \frac{d}{d_0 + d}s^2_g\]

  • More simply: “shrinking” the observed gene-specific variance towards the “typical” variance implied by the prior

Illustration

Figure 5 from https://doi.org/10.1186/s12859-019-3248-9

Moderated t statistic

  • plug in this posterior mean estimate to obtain a ‘moderated’ t statistic:

\[\large \tilde{t}_{gj} = \frac{\hat{\alpha}_{gj}}{\tilde{s}_g \sqrt{v_{jj}}}\]

  • Under limma’s assumptions, we know the null distribution for \(\tilde{t}_{gj}\) under \(H_0\): \[\tilde{t}_{gj} \sim t_{d_0+d}\]

  • parameters from the prior \(d_0\) and \(s_0^2\) are estimated from the data

  • This is how limma is a hybrid of frequentist (t-statistic) and Bayesian (hierarchical model) approaches (i.e. empirical Bayes)

Side-by-side comparison of key quantities and results

OLS limma
Estimated gene-wise residual variance1: \({s}_g^2 = \frac{1}{n-p} \hat{\boldsymbol\varepsilon}^T \hat{\boldsymbol\varepsilon}\) \(\tilde{s}^2_g = \frac{d_0s_0^2 + ds^2_g}{d_0 + d}\)
t-statistic for \(H_0: \alpha_{gj} = 0\): \({t}_{gj} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}}\) \(\tilde{t}_{gj} = \frac{\hat{\alpha}_{gj}}{\tilde{s}_g \sqrt{v_{jj}}}\)
Distribution of t-statistic under \(H_0\): \({t}_{gj} \sim t_{d}\) \(\tilde{t}_{gj} \sim t_{d_0+d}\)

Moderated vs traditional tests

  • moderated variances will be “shrunk” toward the typical gene-wise variance, relative to to raw sample residual variances
  • degrees of freedom for null distribution increase relative to default \((d \text{ vs } d_0 + d)\)
    • \(\rightarrow\) makes it closer to a standard normal
    • \(\rightarrow\) makes tail probabilities (p-values) smaller
    • \(\rightarrow\) easier to reject the null
  • overall, when all is well, limma will deliver statistical results that are more stable and more powerful

Preview: limma workflow