Continuous models and intro to `limma`

Keegan Korthauer

February 4, 2025

Laying the foundation of linear models

t -tests can be used to test the equality of 2 population means
ANOVA can be used to test the equality of more than 2 population means
Linear regression provides a general framework for modeling the relationship between a response variable and different types of explanatory variables
- t-tests can be used to test the significance of individual coefficients
- F-tests can be used to test the simultaneous significance of multiple coefficients (e.g. multiple levels of a single categorical factor, or multiple factors at once)
- F-tests can be used to compare nested models (overall effects or goodness of fit)

Learning objectives for today

Understand how linear regression represents continuous variables:
1. Be familiar with the intuition behind how the regression line is estimated (Ordinary Least Squares)
2. Interpret parameters in a multiple linear regression model with continuous and factor variables
Explain the motivation behind specialized regression models in high-dimensional settings
1. List the advantages of the Empirical Bayes techniques in limma compared to traditional linear regression models

What if we treat age as a continuous variable?

Code

# read in our dataset
eset <- getGEO("GSE4051", getGPL = FALSE)[[1]]

# recode time points
pData(eset) <- pData(eset) %>%
  mutate(sample_id = geo_accession) %>%
  mutate(dev_stage =  case_when(
    grepl("E16", title) ~ "E16",
    grepl("P2", title) ~ "P2",
    grepl("P6", title) ~ "P6",
    grepl("P10", title) ~ "P10",
    grepl("4 weeks", title) ~ "P28"
  )) %>%
  mutate(genotype = case_when(
    grepl("Nrl-ko", title) ~ "NrlKO",
    grepl("wt", title) ~ "WT"
  ))

# reorder factor levels, add continous age variable
pData(eset) <- pData(eset) %>%
  mutate(dev_stage = fct_relevel(dev_stage, "E16", "P2", "P6", "P10", "P28")) %>%
  mutate(genotype = as.factor(genotype)) %>%
  mutate(genotype = fct_relevel(genotype, "WT", "NrlKO")) %>%
  mutate(age = ifelse(dev_stage == "E16", -4,
                            ifelse(dev_stage == "P2", 2, 
                                   ifelse(dev_stage == "P6", 6, 
                                          ifelse(dev_stage == "P10", 10, 28)))))

# function to return tidy data that merges expression matrix and metadata
toLongerMeta <- function(expset) {
    stopifnot(class(expset) == "ExpressionSet")
    
    expressionMatrix <- lonExpressionressionMatrix <- exprs(expset) %>% 
    as.data.frame() %>%
    rownames_to_column("gene") %>%
    pivot_longer(cols = !gene, 
                 values_to = "expression",
                 names_to = "sample_id") %>%
    left_join(pData(expset) %>% select(sample_id, dev_stage, age, genotype),
            by = "sample_id")
  return(expressionMatrix)
}

# pull out two genes of interest
twoGenes <- toLongerMeta(eset) %>% 
  filter(gene %in% c("1456341_a_at", "1441811_x_at")) %>%
  mutate(gene = ifelse(gene == "1456341_a_at", "Klf9", "Tmem176a")) 

# make some plots - first with age continuous
Klf9_C <- ggplot(filter(twoGenes, gene == "Klf9"), 
                  aes(x = age, y = expression)) + 
             geom_point(alpha = 0.7) +
             labs(title = "Klf9") +
             theme(legend.position = "none") +
             ylim(5, 11) +
             xlab("age (days)") 

Tmem176a_C <- ggplot(filter(twoGenes, gene == "Tmem176a"), 
                 aes(x = age, y = expression)) + 
             geom_point(alpha = 0.7) +
             labs(title = "Tmem176a") +
             ylim(5, 11) +
             xlab("age (days)") 

# next with age categorical
Klf9 <- ggplot(filter(twoGenes, gene == "Klf9"), 
                  aes(x = dev_stage, y = expression)) + 
             geom_point(alpha = 0.7) +
             theme(legend.position = "none") +
             labs(title = "Klf9") +
             ylim(5, 11) +
             xlab("Developmental Stage")

Tmem176a <- ggplot(filter(twoGenes, gene == "Tmem176a"), 
                 aes(x = dev_stage, y = expression)) + 
             geom_point(alpha = 0.7) +
             labs(title = "Tmem176a") +
             ylim(5, 11) +
             xlab("Developmental Stage") 

grid.arrange(Klf9, 
             Tmem176a + ylab(""), 
             Klf9_C, 
             Tmem176a_C + ylab(""), 
             nrow = 2)

Linear model with age as continuous covariate

Code

grid.arrange(Klf9_C  + geom_smooth(method='lm', colour = "grey", se = FALSE), 
             Tmem176a_C + geom_smooth(method='lm', colour = "grey", se = FALSE) + ylab(""), 
             nrow = 1)

Linear looks reasonable for gene Tmem176a, but not so much for Klf9
For now, assume linear is reasonable

Simple Linear Regression (Matrix form)

\[ \mathbf{Y = X \boldsymbol\alpha + \boldsymbol\varepsilon}\]

For 1 continuous/quantitative covariate:

\[\mathbf{Y} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{bmatrix}, \hspace{1em} \mathbf{X} = \begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \\ \end{bmatrix}, \hspace{1em} \boldsymbol\alpha = \begin{bmatrix} \alpha_{0} \\ \alpha_{1} \\ \end{bmatrix}, \hspace{1em} \boldsymbol\varepsilon=\begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \\ \end{bmatrix}\]

\(\alpha_0=\) the intercept (expected value of \(y\) when \(x\) is equal to zero)
\(\alpha_1=\) the slope (expected change in \(y\) for every one-unit increase in \(x\))

Simple Linear Regression (Matrix form)

\[ \mathbf{Y = X \boldsymbol\alpha + \boldsymbol\varepsilon}\]

Remember / convince yourself that the matrix algebra yields simple linear equations: \[\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{bmatrix}=\begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \\ \end{bmatrix}\begin{bmatrix} \alpha_{0} \\ \alpha_{1} \\ \end{bmatrix}+\begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \\ \end{bmatrix}=\begin{bmatrix} 1*\alpha_0 + x_{1}\alpha_{1} \\ 1*\alpha_0 + x_{2}\alpha_{1} \\ \vdots \\ 1*\alpha_0 + x_{n}\alpha_{1} \\ \end{bmatrix}+\begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \\ \end{bmatrix}\]

\[=\begin{bmatrix} \alpha_0 + x_{1}\alpha_{1} + \varepsilon_{1} \\ \alpha_0 + x_{2}\alpha_{1} + \varepsilon_{2}\\ \vdots \\ \alpha_0 + x_{n}\alpha_{1} + \varepsilon_{n} \\ \end{bmatrix}\] \[\Rightarrow y_i = \alpha_0 + x_i\alpha_1 + \varepsilon_i\]

SLR with continuous age covariate

Tmem_fit <- filter(twoGenes, gene == "Tmem176a") %>%
  lm(expression ~ age, data = .)
tidy(Tmem_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   8.10     0.114       71.1  3.58e-41
2 age          -0.0614   0.00821     -7.47 6.74e- 9

Interpretation of intercept:

\(H_0: \alpha_0 = 0\) tests the null hypothesis that the intercept is zero - usually, not of interest

SLR with continuous age covariate

tidy(Tmem_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   8.10     0.114       71.1  3.58e-41
2 age          -0.0614   0.00821     -7.47 6.74e- 9

Interpretation of slope:

\(H_0: \alpha_1 = 0\) tests the null hypothesis that there is no association between gene expression and age - usually of interest

How do we estimate the intercept and slope?

Why is this the optimal line?

tidy(Tmem_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   8.10     0.114       71.1  3.58e-41
2 age          -0.0614   0.00821     -7.47 6.74e- 9

Which one is the best line?

Ordinary Least Squares

Ordinary Least Squares (OLS) regression: parameter estimates minimize the sum of squared errors
Error: vertical \((y)\) distance between the true regression line (unobserved) and the real observation
Residual: vertical \((y)\) distance between the fitted regression line and the real observation (estimated error)

OLS Estimator for Simple Linear Regression (1 covariate)

Mathematically: \(\varepsilon_i\) represents the error: \(\varepsilon_i = y_i - \alpha_0 - \alpha_1x_i, i = 1, ..., n\)
We want to find the line (i.e. intercept and slope) that minimizes the sum of squared errors: \[S(\alpha_0, \alpha_1)= \sum_{i=1}^n (y_i - \alpha_0 - \alpha_1 x_i)^2\]
- \(S(\alpha_0, \alpha_1)\) is called an objective function
- the observed sum of squared errors is referred to as Residual Sum of Squares (RSS)

Error vs Residual

Error refers to the deviation of the observed value to the (underlying) true quantity of interest
Residual refers to the difference between the observed and estimated quantity of interest

How to obtain estimates \((\hat{\alpha}_0, \hat{\alpha}_1)\) ? Let’s look at a more general case

OLS for Multiple Linear Regression (p covariates)

\[S(\alpha_0, \alpha_1, \alpha_2, ..., \alpha_p) = \sum_{i=1}^n (y_i - \alpha_0 - \alpha_1 x_{1i} - \alpha_2 x_{2i} - ... - \alpha_p x_{pi})^2\] \[=(\mathbf{y}-\mathbf{X}\boldsymbol\alpha)^T(\mathbf{y}-\mathbf{X}\boldsymbol\alpha)\]

We need to find values of \(\boldsymbol\alpha=(\alpha_0, \alpha_1, ..., \alpha_p)\) that minimize the sum of squares \(S\)

Take partial derivatives with respect to each coeff, set to zero, solve system of equations:

\[\small \frac{\partial S}{\partial\boldsymbol\alpha}=\begin{bmatrix} \frac{\partial S}{\partial\alpha_0} \\ \frac{\partial S}{\partial\alpha_1} \\ \vdots\\ \frac{\partial S}{\partial\alpha_p} \\ \end{bmatrix}=\begin{bmatrix} 0 \\ 0 \\ \vdots\\ 0 \\ \end{bmatrix}\]

Sums of squares for Tmem176a

Fixing the intercept at 8.1000793, let’s plot the RSS values for a range of possible slope values.

Code

rss <- function(intercept, slope, x, y, data){
  resid <- pull(data, y) - (intercept + slope*pull(data, x))
  return(sum(resid^2))
}

slopes = seq(-0.2, 0.2, len = 400)
results <- data.frame(slope = slopes,
                      RSS = sapply(slopes, 
                                   rss, 
                                   intercept = Tmem_fit$coefficients[1],
                                   x = "age", 
                                   y = "expression",
                                   data = filter(twoGenes, gene == "Tmem176a")))
best.i <- which.min(results$RSS)
ggplot(results, aes(slope, RSS)) + 
  geom_line() +
  geom_point(data = results[best.i,], colour = "red", size = 2)

OLS interactive demo

Launch demo

Explore

In the first plot, drag individual points around and observe what changes in second plot
In the second plot, adjust the slope and intercept dials - what happens to the total area of the squares in the second plot when you modify the slope and intercept from the default values?
- Note that you can reset to default values by refreshing the page

Properties of OLS regression

Regression model: \(\mathbf{Y = X \boldsymbol\alpha + \boldsymbol\varepsilon}\)

OLS estimator: \(\hat{\boldsymbol\alpha} =(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

Fitted/predicted values: \(\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol\alpha} = \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{H}\mathbf{y}\)

where \(\mathbf{H}=\mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) is called the “hat” or projection matrix

Assumptions of OLS Regression

\(\boldsymbol\varepsilon\) have mean zero
\(\boldsymbol\varepsilon\) are iid (implies constant variance)
(only required for hypothesis testing in small sample settings) \(\boldsymbol\varepsilon\) are Normally distributed

Connection to other estimators

If \(\boldsymbol\varepsilon\) are iid Normal, then OLS estimator is also MLE (Maximum Likelihood Estimator)

Properties of OLS regression (cont’d)

Residuals: (note NOT the same as errors \(\boldsymbol\varepsilon\)) \[\hat{\boldsymbol\varepsilon} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol\alpha}\]

Estimated error (residual) variance: \[\hat{\sigma}^2 = \frac{1}{n-p-1} \hat{\boldsymbol\varepsilon}^T \hat{\boldsymbol\varepsilon}\]

Estimated covariance matrix of \(\hat{\boldsymbol\alpha}\): \[\hat{Var}(\hat{\boldsymbol\alpha}) = \hat{\sigma}^2(\mathbf{X}^T\mathbf{X})^{-1}\]

Estimated standard errors for estimated regression coefficients: \(\hat{se}(\hat{\alpha}_j)\), obtained by taking the square root of the diagonal elements of \(\hat{Var}(\hat{\boldsymbol\alpha})\)

Inference in Regression (normal iid errors)

How to test \(H_0: \alpha_j = 0\)?

With a t-test!

Under \(H_0\),

\[\frac{\hat{\alpha}_j}{\hat{se}(\hat{\alpha}_j)} \sim t_{n-p-1}\]

So a p-value is obtained by computing a tail probability for the observed value of \(\hat{\alpha}_j\) from a \(t_{n-p-1}\) distribution

Inference - what if we don’t assume Normal errors?

How to test \(H_0: \alpha_j = 0\)?

Assuming large enough sample size, with a t-test!

Under \(H_0\), the _____________ tells us that as \(n\) gets large, even if errors aren’t normal:

\[\frac{\hat{\alpha}_j}{\hat{se}(\hat{\alpha}_j)} \sim t_{n-p-1}\]

So with a large enough sample size a p-value for this hypothesis test is obtained by computing a tail probability for the observed value of \(\hat{\alpha}_j\) from a \(t_{n-p-1}\) distribution

Diagnostics plots

Do our assumptions hold?

Constant variance
iid errors
Normality of errors

# requires package 'ggResidpanel'
ggResidpanel::resid_panel(Tmem_fit)

Linear regression

The nature of the regression function \(y=f(x|\boldsymbol\alpha)\) is one of the defining characteristics of a regression model

If \(f\) is not linear in \(\boldsymbol\alpha \Rightarrow\) nonlinear model
- For example, consider nonlinear parametric regression: \[ y_i = \frac{1}{1 + e^{\alpha_0 + \alpha_1 x_i}} + \varepsilon_i\]
If \(f\) is linear in \(\boldsymbol\alpha \Rightarrow\) linear model
- We just examined simple linear regression (a linear model): \(y_i = \alpha_0 + \alpha_1x_i + \varepsilon_i\)
- What we could do instead: polynomial regression (also a linear model) \[y_i = \alpha_0 + \alpha_1x_i + \alpha_2x_i^2 + \varepsilon_i\]

oneGene <- toLongerMeta(eset) %>% 
  filter(gene %in% c("1427275_at")) %>%
  mutate(gene = "Smc4")
lm(expression ~ age + I(age^2), data = oneGene) %>% tidy()

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  8.48      0.161       52.7  1.11e-35
2 age         -0.147     0.0326      -4.52 6.52e- 5
3 I(age^2)     0.00501   0.00116      4.30 1.23e- 4

Code

Smc4 <- ggplot(oneGene, aes(x = age, y = expression)) + 
  geom_point(alpha = 0.7) +
  labs(title = "Smc4") +
  theme(legend.position = "none") +
  ylim(6, 11) +
  xlab("age (days)") + 
  stat_smooth(method="lm", se=FALSE, fill=NA, 
              formula=y ~ poly(x, 2),colour="blue")

Smc4

Important

Note that this is still a linear model, because it is linear in the \(\alpha_j\).

Hint: think about what the design matrix looks like in this case.

Putting it all together (continuous + categorical variables)

Code

Tmem176a_B <- ggplot(filter(twoGenes, gene=="Tmem176a"), 
                  aes(x = age, y = expression, colour = genotype)) + 
             geom_point(alpha = 0.7) +
             labs(title = "Tmem176a") +
             ylim(5, 10) +
             xlab("age (days)") + 
             stat_smooth(method="lm", se=FALSE)
Tmem176a_B

Interaction between continuous and categorical variables

lm(expression ~ age*genotype, data = filter(twoGenes, gene=="Tmem176a")) %>% 
  tidy()

# A tibble: 4 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        8.03       0.157     51.3   1.57e-34
2 age               -0.0665     0.0114    -5.82  1.33e- 6
3 genotypeNrlKO      0.142      0.228      0.623 5.37e- 1
4 age:genotypeNrlKO  0.00987    0.0164     0.600 5.52e- 1

(Intercept): Intercept of reference line (_____)

age: Slope of reference line (_____)

genotypeNrlKO: Difference in intercepts (_____ vs _____)

age:genotypeNrlKO: Difference in slopes (_____ vs _____)

Interpreting the Intercept

Important

Intercept terms refer to the estimates when the continuous covariate is equal to zero. This is not usually very interesting on its own

Interaction between continuous and categorical variables

\[y_{ij} = \alpha_{0} + \tau_{KO}x_{ij, KO} + \tau_{Age} x_{ij, Age} + \tau_{KO:Age} x_{ij, KO}x_{ij, Age}\]

where

\(j \in \{ WT, NrlKO\}\), \(i = 1,2,...,n_j\)
\(x_{ij, KO}\) is the indicator for WT vs KO ( \(x_{ij, KO}=1\) for \(j=NrlKO\) and 0 for \(j=WT\) )
\(x_{ij, Age}\) is the continuous age covariate

Interpretation of parameters:

\(\alpha_0\) is the expected expression in WT for age = ____
The “intercept” for the knockouts is:
The expected increase in expression in WT for every 1 day increase in age is:
The slope for the knockouts is:

Nested models

As always, you can assess the relevance of several terms at once (e.g. everything involving genotype) with an F-test:

Klf9dat <- filter(twoGenes, gene=="Klf9")
anova(lm(expression ~ age*genotype, data = Klf9dat),
      lm(expression ~ age, data = Klf9dat))

Analysis of Variance Table

Model 1: expression ~ age * genotype
Model 2: expression ~ age
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     35 45.948                           
2     37 45.984 -2 -0.036415 0.0139 0.9862

Conclusion

We don’t have evidence that genotype affects the intercept or the slope

F-tests in regression

Model	Example	# predictors	# coefs (df)	RSS
Reduced	expression ~ age	\(p_{Red}=1\)	\(p_{Red}+1=2\)	\(RSS_{Red}\)
Full	expression ~ age * genotype	\(p_{Full}=3\)	\(p_{Full}+1 = 4\)	\(RSS_{Full}\)

Full: \(y_{ij} = \alpha_{0} + \tau_{KO}x_{ij, KO} + \tau_{Age} x_{ij, Age} + \tau_{KO:Age} x_{ij, KO}x_{ij, Age}\)

Reduced: \(y_{ij} = \alpha_{0} + \tau_{Age} x_{ij, Age}\)

Under \(H_0:\) the reduced model explains the same amount variation in the outcome as the full,

\[F = \frac{\frac{RSS_{Red}-RSS_{Full}}{p_{Full}-p_{Red}}}{\frac{RSS_{Full}}{n-p_{Full}-1}} \sim \text{ } F_{p_{Fill}-p_{Red},\text{ } n-{p_{Full}-1}}\] A significant F-test means we reject the null; we have evidence that the full model explains significantly more variation in the outcome than the reduced.

Collinearity & confounding

If there are problems in the experimental design, we may not be able to estimate effects of interest

Caution

The technical definition of collinearity is that a column of the design matrix can be obtained (or accurately approximated) as a linear combination of other columns.

You can think of this as one column (or variable) not containing unique information

This phenomenon can occur if our design is confounded
As an example, let’s pretend we know that all the E16 and P2 mice are female and the rest are male

Example: E16/P2 mice are female and the rest are male

Dataset
Design matrix
Linear combination
lm output

Klf9dat_conf <- Klf9dat %>%
  mutate(sex = ifelse(dev_stage %in% c("E16", "P2"), "female", "male"))
table(Klf9dat_conf$sex, Klf9dat_conf$dev_stage)

        
         E16 P2 P6 P10 P28
  female   7  8  0   0   0
  male     0  0  8   8   8

(mm <- model.matrix( ~ sex + dev_stage, data = Klf9dat_conf))

   (Intercept) sexmale dev_stageP2 dev_stageP6 dev_stageP10 dev_stageP28
1            1       1           0           0            0            1
2            1       1           0           0            0            1
3            1       1           0           0            0            1
4            1       1           0           0            0            1
5            1       0           0           0            0            0
6            1       0           0           0            0            0
7            1       0           0           0            0            0
8            1       1           0           0            1            0
9            1       1           0           0            1            0
10           1       1           0           0            1            0
11           1       1           0           0            1            0
12           1       0           1           0            0            0
13           1       0           1           0            0            0
14           1       0           1           0            0            0
15           1       0           1           0            0            0
16           1       1           0           1            0            0
17           1       1           0           1            0            0
18           1       1           0           1            0            0
19           1       1           0           1            0            0
20           1       1           0           0            0            1
21           1       1           0           0            0            1
22           1       1           0           0            0            1
23           1       1           0           0            0            1
24           1       0           0           0            0            0
25           1       0           0           0            0            0
26           1       0           0           0            0            0
27           1       0           0           0            0            0
28           1       1           0           0            1            0
29           1       1           0           0            1            0
30           1       1           0           0            1            0
31           1       1           0           0            1            0
32           1       0           1           0            0            0
33           1       0           1           0            0            0
34           1       0           1           0            0            0
35           1       0           1           0            0            0
36           1       1           0           1            0            0
37           1       1           0           1            0            0
38           1       1           0           1            0            0
39           1       1           0           1            0            0
attr(,"assign")
[1] 0 1 2 2 2 2
attr(,"contrasts")
attr(,"contrasts")$sex
[1] "contr.treatment"

attr(,"contrasts")$dev_stage
[1] "contr.treatment"

# pull out column corresponding to male sex indicator variable
mm[,"sexmale"]

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 1  1  1  1  0  0  0  1  1  1  1  0  0  0  0  1  1  1  1  1  1  1  1  0  0  0 
27 28 29 30 31 32 33 34 35 36 37 38 39 
 0  1  1  1  1  0  0  0  0  1  1  1  1

# construct linear combination from columns corresponding to male timepoint indicator variables
mm[,"dev_stageP6"] + mm[,"dev_stageP10"] + mm[,"dev_stageP28"]

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 1  1  1  1  0  0  0  1  1  1  1  0  0  0  0  1  1  1  1  1  1  1  1  0  0  0 
27 28 29 30 31 32 33 34 35 36 37 38 39 
 0  1  1  1  1  0  0  0  0  1  1  1  1

lm can’t estimate all the parameters in this collinear/confounded design.

lm(expression ~ sex + dev_stage, data = Klf9dat_conf) %>%
  tidy()

# A tibble: 6 × 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     7.14      0.240    29.7    6.66e-26
2 sexmale         2.85      0.329     8.64   4.25e-10
3 dev_stageP2    -0.815     0.329    -2.48   1.84e- 2
4 dev_stageP6    -1.20      0.318    -3.76   6.44e- 4
5 dev_stageP10   -0.225     0.318    -0.707  4.84e- 1
6 dev_stageP28   NA        NA        NA     NA

Hidden confounding

Caution

Hidden confounders are variables with associations to both the outcome and predictor variable(s), but are not measured (or not included in the model).

They can cause spurious correlations, and illustrate why correlation does not imply causation.

Hidden
Revealed
Simpson’s paradox

Extreme example: trend reverses with/without confounder

image source: Simply Statistics

Linear regression summary

linear model framework is extremely general
one extreme (simple): two-sample common variance t-test
another extreme (flexible): a polynomial, potentially different for each level of some factor
- dichotomous predictor? 👍
- categorical predictor? 👍
- quantitative predictor? 👍
- various combinations of the above? 👍
Don’t be afraid to build models with more than 1 covariate
later, we’ll talk about extensions to discrete outcomes (e.g. dichotomous or counts) via generalized linear models

What about the other 45 thousand probesets??

eset

ExpressionSet (storageMode: lockedEnvironment)
assayData: 45101 features, 39 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM92610 GSM92611 ... GSM92648 (39 total)
  varLabels: title geo_accession ... age (40 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 16505381 
Annotation: GPL1261

Linear regression of many genes

\[\Large \mathbf{Y}_g = \mathbf{X}_g \boldsymbol\alpha_g + \boldsymbol\varepsilon_g\]

The g in the subscript reminds us that we’ll be fitting a model like this for each gene g that we have measured for all samples
Most of the time, the design matrices \(\mathbf{X}_g\) are, in fact, the same for all \(g\). This means we can just use \(\mathbf{X}\)
Note this means the residual degrees of freedom are also the same for all \(g\) \[d_g = d = n - \text{dimension of } \boldsymbol\alpha = n-p-1\]

Linear regression of many genes (cont’d)

Data model: \(\large\mathbf{Y}_g = \mathbf{X} \boldsymbol\alpha_g + \boldsymbol\varepsilon_g\)
Unknown error variance: \(\large Var(\boldsymbol\varepsilon_g) = \sigma^2_g\)
Estimated error variance: \(\large\hat{\sigma}^2_g = s^2_g = \frac{1}{n-p-1}\hat{\boldsymbol\varepsilon_g}^T\hat{\boldsymbol\varepsilon_g}\)
Estimated variance of parameter estimates: \(\large\hat{Var}(\hat{\boldsymbol\alpha_g}) = s^2_g (\mathbf{X}^T\mathbf{X})^{-1} = s^2_g \mathbf{V}\)
- \(\bf{V}\) is the “unscaled covariance” matrix, and is the same for all genes!
- Estimated standard errors for estimated regression coefficients: \(\large\hat{se}(\hat{\alpha}_{jg})\) obtained by taking the square root of the \(j^{th}\) diagonal element of \(\hat{Var}(\hat{\boldsymbol\alpha}_g)\), which is \(s_g\sqrt{v_{jj}}\)

What’s the big deal?

So far, nothing is new - these are the “regular” t statistics for gene g and parameter j:

\[t_{gj} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}} \sim t_{d} \text{ under } H_0\]

But there are so many of them!! 😲

Observed (i.e. empirical) issues with the “standard” t-test approach for assessing differential expression

Important

Some genes with very small p-values (i.e. large -log10 p-values) are not biologically meaningful (small effect size, e.g. fold change)

How do we end up with small p-values but subtle effects?

\[\large t_{gj} = \frac{\hat{\alpha}_{gj}}{\hat{se}(\hat{\alpha}_{gj})} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}} \sim t_d \text{ under } H_0\]

Small variance estimate \(s_g\) leads to large t statistic \(\rightarrow\) small p-value

Recall: estimates of sample variance from small sample sizes tend to under-estimate the true variance!

This has led to the development of specialized methodology for assessing genome-wide differential expression

Empirical Bayesian techniques: `limma`

Smyth 2004

Why use `limma` instead of regular t-tests?

Borrows information from all genes to get a better estimate of the variance (especially in smaller sample size settings)
Efficiently fits many regression models without replicating unnecessary calculations!
Arranges output in a convenient way to ease further analysis, visualization, and interpretation

How does Empirical Bayes work?

Empirical: observed
Bayesian: incorporate ‘prior’ information
Intuition: estimate prior information from data; shrink (nudge) all estimates toward the consensus

Shrinkage = borrowing information across all genes

Genome-wide OLS fits

Gene by gene:
- lm(y ~ x, data = gene) for each gene
- For example, using dplyr::group_modify and broom::tidy

All genes at once, using limma:
- lmFit(object, design)
- object matrix-like object with expression values for all genes
- design is a specially formatted design matrix (more on this later)
- Note that object can be a Bioconductor container, such as an ExpressionSet object

‘Industrial scale’ model fitting is good, because computations involving just the design matrix \(\mathbf{X}\) are not repeated tens of thousands of times unnecessarily:

OLS estimator: \[\hat{\boldsymbol\alpha}_g =(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}_g\]
Fitted/predicted values: \[\hat{\mathbf{y}}_g = \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}_g = \mathbf{H}\mathbf{y}_g\]

OLS of first 2000 genes, using `lm` gene by gene

allGenes <- toLongerMeta(eset) 
allGenes %>% head(10)

# A tibble: 10 × 6
   gene       sample_id expression dev_stage   age genotype
   <chr>      <chr>          <dbl> <fct>     <dbl> <fct>   
 1 1415670_at GSM92610        7.11 P28          28 NrlKO   
 2 1415670_at GSM92611        7.32 P28          28 NrlKO   
 3 1415670_at GSM92612        7.42 P28          28 NrlKO   
 4 1415670_at GSM92613        7.35 P28          28 NrlKO   
 5 1415670_at GSM92614        7.24 E16          -4 NrlKO   
 6 1415670_at GSM92615        7.34 E16          -4 NrlKO   
 7 1415670_at GSM92616        7.38 E16          -4 NrlKO   
 8 1415670_at GSM92617        7.22 P10          10 NrlKO   
 9 1415670_at GSM92618        7.22 P10          10 NrlKO   
10 1415670_at GSM92619        7.12 P10          10 NrlKO

(t.ols <- system.time(lmfits <- allGenes %>%
    filter(gene %in% unique(allGenes$gene)[1:2000]) %>%
    group_by(gene) %>% 
    group_modify(~ tidy(lm(expression ~ age + genotype, 
                           data = .x))) %>%
    select(gene, term, estimate) %>%
    pivot_wider(names_from = term, 
                values_from = estimate)))

   user  system elapsed 
  1.240   0.007   1.248

lmfits %>% head()

# A tibble: 6 × 4
# Groups:   gene [6]
  gene         `(Intercept)`       age genotypeNrlKO
  <chr>                <dbl>     <dbl>         <dbl>
1 1415670_at            7.22  0.000623     -0.000286
2 1415671_at            9.32 -0.00184       0.145   
3 1415672_at            9.76 -0.00393      -0.0422  
4 1415673_at            8.40  0.00398      -0.0436  
5 1415674_a_at          8.52 -0.00598       0.0192  
6 1415675_at            9.67 -0.00642       0.133

OLS of all genes at once, using `limma`:

(t.limma <- system.time( limmafits <- 
  lmFit(eset, model.matrix(~ age + genotype, data = pData(eset)))))

   user  system elapsed 
  0.034   0.002   0.036

limmafits$coefficients %>% head()

             (Intercept)           age genotypeNrlKO
1415670_at      7.217851  0.0006225228 -0.0002861005
1415671_at      9.320083 -0.0018405479  0.1446053811
1415672_at      9.759959 -0.0039281143 -0.0421705559
1415673_at      8.404053  0.0039777804 -0.0436443351
1415674_a_at    8.517675 -0.0059840405  0.0192159017
1415675_at      9.665691 -0.0064185855  0.1330272055

So far, no shrinkage.

Tip

Avoiding repetitive calculations involving \(\mathbf{X}\) leads to ~930X faster computation in this example

How can we better estimate the SE?

\[\large t_{gj} = \frac{\hat{\alpha}_{gj}}{\hat{se}(\hat{\alpha}_{gj})} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}} \sim t_d \text{ under } H_0\]

Important

Under-estimated variance leads to overly large t statistic, which leads to artificially small p-value

Modeling in `limma`

limma assumes that for each gene \(g\) we have the following sampling distributions of the coefficient estimates and sample variances:

\[\hat{\alpha}_{gj} \,|\,\alpha_{gj}, \sigma_g^2 \sim N(\alpha_{gj}, \sigma_g^2 v_{jj})\] \[s^2_g \,|\, \sigma_g^2 \sim \frac{\sigma_g^2}{d}\chi^2_d\] These are the same as the usual assumptions about ordinary \(t\)-statistics:

\[\large t_{gj} = \frac{\hat{\alpha}_{gj}}{\hat{se}(\hat{\alpha}_{gj})} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}} \sim t_d \text{ under } H_0\]

So far, nothing new…

Modeling in `limma` - shrinkage

limma imposes a hierarchical model for how the gene-wise \(\alpha_{gj}\) and \(\sigma^2_g\) vary across genes

Important

Under the limma framework, we are no longer considering genes in isolation. We will leverage information across genes to obtain improved estimates.

this is done by assuming a prior distribution for those quantities
Prior distribution for gene-specific variances \(\sigma^2_g\): inverse \(\chi^2\) with mean \(s_0^2\), and \(d_0\) df:

\[\frac{1}{\sigma^2_g} \sim \frac{1}{d_0s_0^2} \chi^2_{d_0}\]

this should feel a bit funny compared to previous lectures… \(\sigma^2_g\) is no longer a fixed quantity! (i.e. this is Bayesian)

How does this help us better estimate the variance?

The posterior distribution is an updated version of the observed likelihood based on incorporating the prior information
The posterior mean for gene-specific variance:

\[\tilde{s}^2_g = \frac{d_0s_0^2 + ds^2_g}{d_0 + d}\]

Rewrite as a weighted mean of the prior and the observed gene-specific variances:

\[\tilde{s}^2_g = \frac{d_0}{d_0 + d}s_0^2 + \frac{d}{d_0 + d}s^2_g\]

More simply: “shrinking” the observed gene-specific variance towards the “typical” variance implied by the prior

Illustration

Figure 5 from https://doi.org/10.1186/s12859-019-3248-9

Moderated t statistic

plug in this posterior mean estimate to obtain a ‘moderated’ t statistic:

\[\large \tilde{t}_{gj} = \frac{\hat{\alpha}_{gj}}{\tilde{s}_g \sqrt{v_{jj}}}\]

Under limma’s assumptions, we know the null distribution for \(\tilde{t}_{gj}\) under \(H_0\): \[\tilde{t}_{gj} \sim t_{d_0+d}\]
parameters from the prior \(d_0\) and \(s_0^2\) are estimated from the data

This is how limma is a hybrid of frequentist (t-statistic) and Bayesian (hierarchical model) approaches (i.e. empirical Bayes)

Side-by-side comparison of key quantities and results

	OLS	limma
Estimated gene-wise residual variance¹²:	\({s}_g^2 = \frac{1}{d} \hat{\boldsymbol\varepsilon}^T \hat{\boldsymbol\varepsilon}\)	\(\tilde{s}^2_g = \frac{d_0s_0^2 + ds^2_g}{d_0 + d}\)
t-statistic for \(H_0: \alpha_{gj} = 0\):	\({t}_{gj} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}}\)	\(\tilde{t}_{gj} = \frac{\hat{\alpha}_{gj}}{\tilde{s}_g \sqrt{v_{jj}}}\)
Distribution of t-statistic under \(H_0\):	\({t}_{gj} \sim t_{d}\)	\(\tilde{t}_{gj} \sim t_{d_0+d}\)

Moderated vs traditional tests

moderated variances will be “shrunk” toward the typical gene-wise variance, relative to to raw sample residual variances

degrees of freedom for null distribution increase relative to default \((d \text{ vs } d_0 + d)\)
- \(\rightarrow\) makes it closer to a standard normal
- \(\rightarrow\) makes tail probabilities (p-values) smaller
- \(\rightarrow\) easier to reject the null

overall, when all is well, limma will deliver statistical results that are more stable and more powerful

Continuous models and intro to limma

Laying the foundation of linear models

Learning objectives for today

What if we treat age as a continuous variable?

Linear model with age as continuous covariate

Simple Linear Regression (Matrix form)

Simple Linear Regression (Matrix form)

SLR with continuous age covariate

SLR with continuous age covariate

How do we estimate the intercept and slope?

Which one is the best line?

Ordinary Least Squares

OLS Estimator for Simple Linear Regression (1 covariate)

OLS for Multiple Linear Regression (p covariates)

Sums of squares for Tmem176a

OLS interactive demo

Properties of OLS regression

Properties of OLS regression (cont’d)

Inference in Regression (normal iid errors)

Inference - what if we don’t assume Normal errors?

Diagnostics plots

Linear regression

Polynomial regression

Putting it all together (continuous + categorical variables)

Interaction between continuous and categorical variables

Interpreting the Intercept

Interaction between continuous and categorical variables

Nested models

F-tests in regression

Collinearity & confounding

Example: E16/P2 mice are female and the rest are male

Hidden confounding

Linear regression summary

What about the other 45 thousand probesets??

Linear regression of many genes

Linear regression of many genes (cont’d)

What’s the big deal?

Observed (i.e. empirical) issues with the “standard” t-test approach for assessing differential expression

How do we end up with small p-values but subtle effects?

Empirical Bayesian techniques: limma

Why use limma instead of regular t-tests?

How does Empirical Bayes work?

Genome-wide OLS fits

‘Industrial scale’ model fitting is good, because computations involving just the design matrix \(\mathbf{X}\) are not repeated tens of thousands of times unnecessarily:

OLS of first 2000 genes, using lm gene by gene

OLS of all genes at once, using limma:

How can we better estimate the SE?

Modeling in limma

Modeling in limma - shrinkage

How does this help us better estimate the variance?

Illustration

Moderated t statistic

Side-by-side comparison of key quantities and results

Moderated vs traditional tests

Preview: limma workflow

Continuous models and intro to `limma`

Empirical Bayesian techniques: `limma`

Why use `limma` instead of regular t-tests?

OLS of first 2000 genes, using `lm` gene by gene

OLS of all genes at once, using `limma`:

Modeling in `limma`

Modeling in `limma` - shrinkage

Preview: `limma` workflow