More `limma` & multiple testing

Keegan Korthauer

February 6, 2025

Learning objectives

Use limma to perform genome-wide differential expression testing on microarray data
Understand the key advantages between limma and lm (standard linear regression)
Explain why multiple testing increases the number of errors we make by chance
Be able to adjust for multiple comparisons by controlling the False Discovery Rate
- e.g. using Benjamini-Hochberg or Storey’s q-value

Recall: The hybrid estimator in `limma`

\[\tilde{s}^2_g = \frac{d_0}{d_0 + d}s_0^2 + \frac{d}{d_0 + d}s^2_g\]

Recall that \((s_0, d_0)\) are the prior parameters for \(\sigma^2_g\) (random variable):

\[\frac{1}{\sigma^2_g} \sim \frac{1}{d_0s_0^2} \chi^2_{d_0}\]

The prior parameters incorporate information from all genes which allows us to shrink/nudge the gene-specific variances toward a common consensus

they are estimated from the data - the formulas for \((s_0, d_0)\) and their derivations are beyond the scope of the course, but limma takes care of the details for us
note that \((s_0, d_0)\) do not depend on \(g\)

Illustration

Figure 5 from https://doi.org/10.1186/s12859-019-3248-9

Moderated t statistic

plug in this posterior mean estimate to obtain a ‘moderated’ t statistic:

\[\large \tilde{t}_{gj} = \frac{\hat{\alpha}_{gj}}{\tilde{s}_g \sqrt{v_{jj}}}\]

Under limma’s assumptions, we know the null distribution for \(\tilde{t}_{gj}\) under \(H_0\): \[\tilde{t}_{gj} \sim t_{d_0+d}\]
parameters from the prior \(d_0\) and \(s_0^2\) are estimated from the data

This is how limma is a hybrid of frequentist (t-statistic) and Bayesian (hierarchical model) approaches (i.e. empirical Bayes)

Side-by-side comparison of key quantities and results

	OLS	limma
Estimated gene-wise residual variance¹²:	\({s}_g^2 = \frac{1}{d} \hat{\boldsymbol\varepsilon}^T \hat{\boldsymbol\varepsilon}\)	\(\tilde{s}^2_g = \frac{d_0s_0^2 + ds^2_g}{d_0 + d}\)
t-statistic for \(H_0: \alpha_{gj} = 0\):	\({t}_{gj} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}}\)	\(\tilde{t}_{gj} = \frac{\hat{\alpha}_{gj}}{\tilde{s}_g \sqrt{v_{jj}}}\)
Distribution of t-statistic under \(H_0\):	\({t}_{gj} \sim t_{d}\)	\(\tilde{t}_{gj} \sim t_{d_0+d}\)

Moderated (`limma`) vs traditional (`lm`) tests

moderated variances will be “shrunk” toward the typical gene-wise variance, relative to to raw sample residual variances

degrees of freedom for null distribution increase relative to default \((d \text{ vs } d_0 + d)\)
- \(\rightarrow\) makes it closer to a standard normal
- \(\rightarrow\) makes tail probabilities (p-values) smaller
- \(\rightarrow\) easier to reject the null

overall, when all is well, limma will deliver statistical results that are more stable and more powerful

Overview: `limma` workflow

Functions that make your life easier

Function	Description
`model.matrix`	Takes in your data frame and makes a design matrix
`limma::lmFit`	Fits the linear model to each gene separately – replace gene with “feature” depending on your data (‘industrial scale’ `lm`)
`limma::eBayes`	Use output of linear regression to compute moderated t statistics
`limma::topTable`	Query your results; sort your p-values; sort genes; Adjust for multiple comparisons
`limma::decideTests`	Identify which genes are significantly differentially expressed for each contrast
`limma::makeContrasts`	Create the contrast matrix \(C\) that you desire
`limma::contrast.fit`	Apply a contrast to your estimates

Getting help

Bioconductor homepage for limma
Bring up help pages for specific functions in RStudio, e.g. ?limma::topTable

`limma` step one: `lmFit`

\[\mathbf{Y}_g = \mathbf{X} \boldsymbol\alpha_g + \boldsymbol\varepsilon_g,\]

Within each gene observations are iid / constant variance
lmFit() carries out multiple linear regression on each gene
Usage: lmFit(object, design)
- object is a data.frame or matrix with features in rows and samples in columns (G genes by N samples), or other Bioconductor object such as ExpressionSet
- design is a design matrix (output of model.matrix(y ~ x); N samples by \(p+1\) parameters)

Bioconductor’s ExpressionSet class; image source

Let’s run `limma` for the interactive model with age (continuous) and genotype (factor)

\[\Large y_{ig} = \theta + \tau_{KO}x_{ig,KO} + \tau_{Age}x_{ig,Age} + \tau_{KO:Age}x_{ig,KO}x_{ig,Age}\]

\(i\) indexes mouse, \(g\) indexes genes
\(x_{ig,KO}\) is the indicator variable for the NrlKO group
\(x_{ig,Age}\) is the continuous age variable

Interactive model with age and genotype

Example gene (but we want to fit this model on all genes):

Code

# read in our dataset
eset <- getGEO("GSE4051", getGPL = FALSE)[[1]]

# recode time points
pData(eset) <- pData(eset) %>%
  mutate(sample_id = geo_accession) %>%
  mutate(dev_stage =  case_when(
    grepl("E16", title) ~ "E16",
    grepl("P2", title) ~ "P2",
    grepl("P6", title) ~ "P6",
    grepl("P10", title) ~ "P10",
    grepl("4 weeks", title) ~ "P28"
  )) %>%
  mutate(genotype = case_when(
    grepl("Nrl-ko", title) ~ "NrlKO",
    grepl("wt", title) ~ "WT"
  ))

# reorder factor levels, add continous age variable
pData(eset) <- pData(eset) %>%
  mutate(dev_stage = fct_relevel(dev_stage, "E16", "P2", "P6", "P10", "P28")) %>%
  mutate(genotype = as.factor(genotype)) %>%
  mutate(genotype = fct_relevel(genotype, "WT", "NrlKO")) %>%
  mutate(age = ifelse(dev_stage == "E16", -4,
                            ifelse(dev_stage == "P2", 2, 
                                   ifelse(dev_stage == "P6", 6, 
                                          ifelse(dev_stage == "P10", 10, 28)))))

# function to return tidy data that merges expression matrix and metadata
toLongerMeta <- function(expset) {
    stopifnot(class(expset) == "ExpressionSet")
    
    expressionMatrix <- lonExpressionressionMatrix <- exprs(expset) %>% 
    as.data.frame() %>%
    rownames_to_column("gene") %>%
    pivot_longer(cols = !gene, 
                 values_to = "expression",
                 names_to = "sample_id") %>%
    left_join(pData(expset) %>% select(sample_id, dev_stage, age, genotype),
            by = "sample_id")
  return(expressionMatrix)
}

# pull out gene of interest
Tmem_dat <- toLongerMeta(eset) %>% 
  filter(gene == "1441811_x_at") %>%
  mutate(gene = "Tmem176a") 

Tmem_plot <- ggplot(Tmem_dat, aes(x = age, y = expression, colour = genotype)) + 
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Tmem176a") +
  ylim(5, 11) +
  xlab("age (days)") 
Tmem_plot

Computation is fast

Bioconductor way:

system.time(gFit <- lmFit(eset, 
                          model.matrix(~ genotype*age, data = pData(eset))))

   user  system elapsed 
  0.040   0.003   0.176

Equivalently, using the ‘separate’ way:

gFit <- lmFit(myDat, 
              model.matrix(~ genotype*age, data = myMeta))

Using lmFit to fit an interactive model on 45K probesets takes a fraction of a second

The time-intensive parts of an analysis: data wrangling, selecting a model, choosing how to parameterize it, and interpreting it

Output of Step 1 (`lmFit`)

summary(gFit)

                 Length Class  Mode     
coefficients     180404 -none- numeric  
rank                  1 -none- numeric  
assign                4 -none- numeric  
qr                    5 qr     list     
df.residual       45101 -none- numeric  
sigma             45101 -none- numeric  
cov.coefficients     16 -none- numeric  
stdev.unscaled   180404 -none- numeric  
pivot                 4 -none- numeric  
Amean             45101 -none- numeric  
method                1 -none- character
design              156 -none- numeric

nrow(eset)

Features 
   45101

nrow(eset)*4

Features 
  180404

ncol(eset)*4

Samples 
    156

OK… but where are the shrunken variance estimates?? How do I pull out p-values??
Actually, we haven’t carried out the empirical Bayesian computation yet – still need to run eBayes()!

Step 2: Moderated t-tests using `eBayes`

summary(gFit)

                 Length Class  Mode     
coefficients     180404 -none- numeric  
rank                  1 -none- numeric  
assign                4 -none- numeric  
qr                    5 qr     list     
df.residual       45101 -none- numeric  
sigma             45101 -none- numeric  
cov.coefficients     16 -none- numeric  
stdev.unscaled   180404 -none- numeric  
pivot                 4 -none- numeric  
Amean             45101 -none- numeric  
method                1 -none- character
design              156 -none- numeric

summary(ebFit <- eBayes(gFit))

                 Length Class  Mode     
coefficients     180404 -none- numeric  
rank                  1 -none- numeric  
assign                4 -none- numeric  
qr                    5 qr     list     
df.residual       45101 -none- numeric  
sigma             45101 -none- numeric  
cov.coefficients     16 -none- numeric  
stdev.unscaled   180404 -none- numeric  
pivot                 4 -none- numeric  
Amean             45101 -none- numeric  
method                1 -none- character
design              156 -none- numeric  
df.prior              1 -none- numeric  
s2.prior              1 -none- numeric  
var.prior             4 -none- numeric  
proportion            1 -none- numeric  
s2.post           45101 -none- numeric  
t                180404 -none- numeric  
df.total          45101 -none- numeric  
p.value          180404 -none- numeric  
lods             180404 -none- numeric  
F                 45101 -none- numeric  
F.p.value         45101 -none- numeric

Components of the empirical Bayes estimators

math	description	`limma::eBayes`	dimension	in `lm`?
\(s_g^2\)	gene-specific residual variance	`sigma^2`	45K	\(\checkmark\)
\(d\)	residual degrees of freedom \((n-p-1)\)	`df.residual`	Scalar¹	\(\checkmark\)
\(s_0^2\)	mean of inverse \(\chi^2\) prior for \(s_g^2\)	`s2.prior`	Scalar
\(d_0\)	degrees of freedom for the prior	`df.prior`	Scalar
\(\tilde{s}_g^2\)	posterior mean of \(s_g^2\) (moderated gene-specific residual variance)	`s2.post`	45K
\(\tilde{t}_g\)	moderated t statistics	`t`	45K\(\times\)p

Step 3: `topTable()` to extract test output

topTable(fit, coef = NULL, number = 10, genelist = fit$genes, adjust.method = "BH",
         sort.by = "B", resort.by = NULL, p.value = 1, lfc = 0, confint = FALSE)

Refer to the help page (?topTable) for full details of each argument
Summary of key topTable arguments:
- coef is the argument where you specify the coefficient(s) you want to test for equality with zero (default is NULL; must be specified)
- number lets you control size of hit list (default is 10)
- p.value lets you specify a minimum adjusted p-value cutoff (default is 1)
- lfc lets you specify a minimum observed effect size - log2 fold change (default is 0)
- sort.by and resort.by give control over the ordering (default is by “B”: log-odds that the gene is differentially expressed)
- adjust.method specifies how/if to adjust p-values for multiple testing (default is BH)

`topTable` in action: `genotypeNrlKO`

topTable(ebFit, coef = "genotypeNrlKO")

                  logFC  AveExpr          t      P.Value    adj.P.Val        B
1450946_at   -4.7134891 8.733103 -15.583683 5.390931e-18 2.431364e-13 28.40938
1442070_at   -0.9669400 8.268122  -9.787407 6.896220e-12 1.555132e-07 16.46086
1443457_at   -1.1530057 6.049428  -9.091412 4.947029e-11 6.029428e-07 14.68843
1422679_s_at -2.1187394 9.495435  -9.002625 6.388954e-11 6.029428e-07 14.45703
1431708_a_at -2.1610625 8.591450  -8.967950 7.062007e-11 6.029428e-07 14.36634
1433050_at   -1.5083887 8.468891  -8.923932 8.021234e-11 6.029428e-07 14.25096
1426288_at   -4.2100657 9.323796  -8.772209 1.246491e-10 8.031144e-07 13.85106
1418108_at    0.8662091 8.260647   8.537984 2.475294e-10 1.395478e-06 13.22713
1450770_at    1.3103576 7.357033   8.128847 8.332369e-10 3.747976e-06 12.11870
1457802_at   -0.8362030 7.778189  -8.117112 8.629984e-10 3.747976e-06 12.08657

topTable(ebFit, coef = 2) is equivalent here, but much less informative!!
What is the null hypothesis here?

Plot the top 6 probes for `genotypeNrlKO`

Code

# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = "genotypeNrlKO", number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixGenotype <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixGenotype, 
       aes(x = age, y = expression, color = genotype)) + 
  geom_point() +
  xlab("age (days)") +
  facet_wrap( ~ gene) + 
  stat_smooth(method="lm", se=FALSE, cex=0.5)

`topTable` in action: `age`

topTable(ebFit, coef = "age")

                  logFC  AveExpr        t      P.Value    adj.P.Val        B
1451635_at   0.19294061 7.791237 20.78935 3.204717e-22 1.445359e-17 40.60768
1425222_x_at 0.17604111 6.514475 19.72633 1.962954e-21 4.426561e-17 38.79502
1433699_at   0.14875031 6.939973 17.80467 6.442530e-20 7.957605e-16 35.29412
1452243_at   0.15889179 6.840975 17.75658 7.057586e-20 7.957605e-16 35.20253
1435436_at   0.13067119 8.274676 17.61808 9.187120e-20 8.286966e-16 34.93763
1458418_at   0.21277027 6.270187 16.43686 9.326171e-19 7.006477e-15 32.60718
1424977_at   0.07814988 6.505464 16.36079 1.087456e-18 7.006477e-15 32.45260
1431174_at   0.16158579 7.418000 16.01901 2.183055e-18 1.230725e-14 31.75106
1421818_at   0.11250919 7.982278 15.66666 4.531133e-18 2.270651e-14 31.01562
1419069_at   0.09325234 8.222096 15.54775 5.813459e-18 2.621928e-14 30.76457

topTable(ebFit, coef = 3) is equivalent here, but much less informative!!
What is the null hypothesis here?

Plot the top 6 probes for `age`

Code

# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = "age", number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixAge <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixAge, 
       aes( x = age, y = expression, color = genotype)) + 
             geom_point() +
             xlab("age (days)") +
             facet_wrap( ~ gene) + 
             stat_smooth(method="lm", se=FALSE, cex=0.5)

`topTable` in action: `genotypeNrlKO:age`

topTable(ebFit, coef = "genotypeNrlKO:age")

                  logFC  AveExpr         t      P.Value    adj.P.Val        B
1451635_at   -0.1871800 7.791237 -14.00463 1.684302e-16 7.596369e-12 27.52010
1425222_x_at -0.1700908 6.514475 -13.23451 9.961393e-16 2.246344e-11 25.74877
1441809_at    0.1336356 6.649440  12.55702 5.037221e-15 7.572790e-11 24.13110
1452243_at   -0.1531326 6.840975 -11.88284 2.670693e-14 3.011273e-10 22.46414
1431174_at   -0.1655894 7.418000 -11.39882 9.161231e-14 8.263614e-10 21.23123
1458418_at   -0.2037314 6.270187 -10.92850 3.122990e-13 2.347500e-09 20.00384
1435436_at   -0.1154500 8.274676 -10.80855 4.289363e-13 2.763637e-09 19.68613
1446715_at   -0.1409774 5.613122 -10.57909 7.912260e-13 4.325169e-09 19.07307
1442190_at   -0.1323038 5.896077 -10.54672 8.630965e-13 4.325169e-09 18.98601
1416306_at    0.1702059 6.560050  10.41285 1.238250e-12 5.584630e-09 18.62456

topTable(ebFit, coef = 4) is equivalent here, but much less informative!!
What is the null hypothesis here?

Plot the top 6 probes for `genotypeNrlKO:age`

Code

# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = "genotypeNrlKO:age", number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixItx <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixItx, 
       aes( x = age, y = expression, color = genotype)) + 
             geom_point() +
             xlab("age (days)") +
             facet_wrap( ~ gene) + 
             stat_smooth(method="lm", se=FALSE, cex=0.5)

`topTable` in action: any effect of genotype

topTable(ebFit, coef = c("genotypeNrlKO", "genotypeNrlKO:age"))

             genotypeNrlKO genotypeNrlKO.age  AveExpr        F      P.Value
1450946_at      -4.7134891       -0.05411298 8.733103 246.5762 2.403638e-22
1451635_at       0.4868705       -0.18718005 7.791237 129.8112 1.307185e-17
1451617_at      -0.9885684       -0.22329233 7.419244 128.9935 1.450111e-17
1425222_x_at     0.3698846       -0.17009077 6.514475 120.0041 4.719613e-17
1441809_at      -0.1576064        0.13363563 6.649440 117.5571 6.594308e-17
1416306_at       0.4278776        0.17020589 6.560050 113.2125 1.212332e-16
1439143_at       0.6570269        0.06170447 5.856507 111.2102 1.616023e-16
1449526_a_at     0.9935801        0.04232958 7.372946 110.5598 1.775851e-16
1452243_at       0.1639373       -0.15313265 6.840975 106.2784 3.344140e-16
1451763_at      -0.6819899       -0.14010359 8.160065 102.7010 5.771358e-16
                adj.P.Val
1450946_at   1.084065e-17
1451635_at   2.180049e-13
1451617_at   2.180049e-13
1425222_x_at 5.321482e-13
1441809_at   5.948198e-13
1416306_at   9.112895e-13
1439143_at   1.001158e-12
1449526_a_at 1.001158e-12
1452243_at   1.675823e-12
1451763_at   2.442992e-12

topTable(ebFit, coef = c(2,4)) is equivalent here, but much less informative!!
What is the null hypothesis here?

Plot the top 6 probes for any effect of genotype

Code

# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = c("genotypeNrlKO", "genotypeNrlKO:age"), number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixGenotypeMarginal <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixGenotypeMarginal, 
       aes( x = age, y = expression, color = genotype)) + 
             geom_point() +
             xlab("age (days)") +
             facet_wrap( ~ gene) + 
             stat_smooth(method="lm", se=FALSE, cex=0.5)

Comparison of \(s_g^2\) and \(\tilde{s}_g^2\) (shrinkage!)

Fill in the blank (increases or decreases):

For large variances, limma ___________ the gene-specific variance estimates.
For small variances, limma ___________ the gene-specific variance estimates.

original scale
log-scale

Code

df <- data.frame(limma = ebFit$s2.post, 
                 lm = gFit$sigma^2)
shrinkplot <- ggplot(df, aes(y = limma, x = lm)) +
  geom_point(alpha=0.2) +
  geom_abline(intercept=0, slope=1, color="red") +
  ggtitle("Ests of gene-specific variance")
shrinkplot

Code

shrinkplot +
  scale_x_log10() +
  scale_y_log10()

This plot is on the log-scale to ‘zoom in’ on the low range

P-value comparison for standard vs moderated t-tests

Example from PH525x online book:

Volcano plots
Gene-specific standard deviations (non-moderated)

Multiple testing

What do we do with our lists of thousands of p-values?

Recall the two types of errors

Error rates

\[\alpha = P(\text{Type I Error}), \text{ } \beta = P(\text{Type II Error}), \text{ Power} = 1- \beta\]

Type I Error rate for \(m\) tests

\(P(\text{incorrect decision} | H_0) = \alpha\)
- let \(\alpha=0.05\)
\(P(\text{correct decision} | H_0) = 1-\alpha = 0.95\)

\(P(\text{correct decision on }m\text{ tests} | H_0) =\) \[(1-\alpha)^m = 0.95^m\]
\(P(\text{at least one error on }m\text{ tests} | H_0) =\) \[1 - (1-\alpha)^m =\] \[1-0.95^m = \alpha_{FWER}\]
FWER stands for “Family-wise error rate”

Code

df <- data.frame(m = seq(1,200)) %>%
  mutate(y = 1-0.95^m)
ggplot(df, aes(x = m, y = y)) +
  geom_line(size = 1.5, alpha = 0.7, colour = "dodgerblue") +
  xlab("m (number of tests)") +
  ylab(expression(paste(1-0.95^m, " (FWER)")))

Multiple comparisons in genomics

Controlling FWER with the Bonferroni correction

FWER is the probability of making at least one error when testing \(m\) hypotheses

Control the FWER: limit the probability of making at least one incorrect decision
One example: the Bonferroni correction for \(\alpha=0.05\):

\[\text{We want: } P(\text{at least one error on }m \text{ tests}) \lt \alpha\]

\[\text{We know: } P(\text{at least one error on }m \text{ tests}) \lt \sum_{i=1}^m P(\text{error on test }i)\]

\[\text{We let: } P(\text{error on test }i) = \alpha_{Bon} \Rightarrow \sum_{i=1}^m P(\text{error on test }i) = m \alpha_{Bon} \]

\[\Rightarrow \alpha_{Bon} = \frac{\alpha}{m} = \frac{0.05}{m} \text{ (Reject if p-value } p_i \lt \frac{\alpha}{m} \text{)}\]

Can we do better?

Bonferroni correction is very conservative (i.e. controls the FWER even lower than \(\alpha\) in many settings)
Several other options are better
For example, the Holm procedure: multiplier for p-value correction is not the same for all genes; more powerful

\[p_{Holm(1)} = mp_{(1)}\]

\[p_{Holm(2)} = (m-1)p_{(2)}\] \[p_{Holm(3)} = (m-2)p_{(3)}\] \[\vdots\] \[\Rightarrow FWER \le \alpha\] —

How practical is the FWER in high-throughput biology?

Why do we care so much about making one single error??

One easy way to ensure no Type I errors: reject no hypotheses! 😃
- However, then our power is zero… 😭

Important

Being overly strict about Type I error leads to greater Type II error (loss of power)

Radical idea

FWER is not that relevant in high-throughput studies

It’s OK to make multiple errors, as long as you also have a comparatively large number of true positives!

Enter: the False Discovery Rate (FDR)

Benjamini Y, Hochberg Y. “Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological). 1995 Jan;57(1): 289-300.

Over 110,000 citations to date!!

False Discovery Rate

FDR is designed to control the expected proportion of false positives (V) among all hypotheses where the null has been rejected (R)

\[FDR = E \Big[ \frac{V}{R} \Big]\]

FDR vs FPR vs FWER

False Discovery Rate (FDR) is the rate that significant features \((R)\) are truly null

\[FDR = E \Big[ \frac{V}{R} \Big]\]

False Positive Rate (FPR) is the rate that truly null features \((m_0)\) are called significant

\[FPR = E\Big[\frac{V}{m_0}\Big]\]

Family-Wise Error Rate (FWER) is the probability that the number of truly null features rejected \((V)\) is at least 1

\[\text{FWER } = P(V\ge1) \]

Benjamini-Hochberg FDR (BH procedure)

Proposed the idea of controlling FDR instead of FWER
Proposed a procedure for doing so
- note that we know \(R\), but we don’t know \(V\)
Procedure: control FDR at level \(q\)
1. order the raw p-values \(p_{(1)} \le p_{(2)} \le ...\le p_{(m)}\)
2. find test with highest rank \(j\) such that \(p_{(j)} < \frac{jq}{m}\)
3. declare all smaller ranks up to \(j\) significant

Controlling FDR at level \(q\) = 0.05

Rank \((j)\)	P-value
1	0.0008
2	0.009
3	0.127
4	0.205
5	0.396
6	0.450
7	0.641
8	0.781
9	0.900
10	0.993

\(\text{}\)

Rank \((j)\)	P-value	BH-threshold: jq/m
1	0.0008	0.005
2	0.009	0.010
3	0.127	0.015
4	0.205	0.020
5	0.396	0.025
6	0.450	0.030
7	0.641	0.035
8	0.781	0.040
9	0.900	0.045
10	0.993	0.050

\(\text{}\)

Rank \((j)\)	P-value	BH-threshold: jq/m	BH-adjusted p-value: \(p_{(i)}^{BH}= \text{min}(\text{min}_{j\ge i}(mp_{(j)}/j), 1)\)
1	0.0008	0.005	0.0080
2	0.009	0.010	0.0450
3	0.127	0.015	0.4233
4	0.205	0.020	0.5125
5	0.396	0.025	0.7500
6	0.450	0.030	0.7500
7	0.641	0.035	0.9157
8	0.781	0.040	0.9763
9	0.900	0.045	0.9930
10	0.993	0.050	0.9930

\(\text{}\)

Rank \((j)\)	P-value	BH-threshold: jq/m	Reject \(H_0\)?
1	0.0008	0.005	\(\checkmark\)
2	0.009	0.010	\(\checkmark\)
3	0.127	0.015
4	0.205	0.020
5	0.396	0.025
6	0.450	0.030
7	0.641	0.035
8	0.781	0.040
9	0.900	0.045
10	0.993	0.050	\(\text{ }\)

\(\text{}\)

Rank \((j)\)	P-val	BH-thr: jq/m	Reject \(H_0\)?	\(FWER < 0.05\)?
1	0.0008	0.005	\(\checkmark\)	\(\checkmark\)
2	0.009	0.010	\(\checkmark\)
3	0.127	0.015
4	0.205	0.020
5	0.396	0.025
6	0.450	0.030
7	0.641	0.035
8	0.781	0.040
9	0.900	0.045
10	0.993	0.050

Where \(\alpha_{Bon}=0.05/10=0.005\)

BH FDR values given in `limma` by default

topTable(ebFit, coef = "genotypeNrlKO")

                  logFC  AveExpr          t      P.Value    adj.P.Val        B
1450946_at   -4.7134891 8.733103 -15.583683 5.390931e-18 2.431364e-13 28.40938
1442070_at   -0.9669400 8.268122  -9.787407 6.896220e-12 1.555132e-07 16.46086
1443457_at   -1.1530057 6.049428  -9.091412 4.947029e-11 6.029428e-07 14.68843
1422679_s_at -2.1187394 9.495435  -9.002625 6.388954e-11 6.029428e-07 14.45703
1431708_a_at -2.1610625 8.591450  -8.967950 7.062007e-11 6.029428e-07 14.36634
1433050_at   -1.5083887 8.468891  -8.923932 8.021234e-11 6.029428e-07 14.25096
1426288_at   -4.2100657 9.323796  -8.772209 1.246491e-10 8.031144e-07 13.85106
1418108_at    0.8662091 8.260647   8.537984 2.475294e-10 1.395478e-06 13.22713
1450770_at    1.3103576 7.357033   8.128847 8.332369e-10 3.747976e-06 12.11870
1457802_at   -0.8362030 7.778189  -8.117112 8.629984e-10 3.747976e-06 12.08657

Or, obtain them yourself for any vector of p-values p with p.adjust(p, method="BH"):

pvals <- topTable(ebFit, coef = "genotypeNrlKO", number = Inf)$P.Value
p.adjust(pvals, method = "BH") %>% head()

[1] 2.431364e-13 1.555132e-07 6.029428e-07 6.029428e-07 6.029428e-07
[6] 6.029428e-07

Other ways to control FDR

BH is just one (the first) method to control FDR
Since the publication of the BH method, other methods have been proposed
One of the most popular is Storey’s q-value

qvalue package implementation: provides adjusted p-values

Storey’s q-value vs BH (Conceptual)

Just like BH, is focused on the proportion of discoveries that are false positives
Conceptual difference between BH and Storey’s q-value is:
- BH controls the FDR
- q-values give an unbiased estimate of the FDR (will control the FDR on average)

Storey’s q-value vs BH (Mathematical)

Mathematically, the difference between the two is in how \(m_0\) is estimated
- Or equivalently, how \(\pi_0=\frac{m_0}{m}\) is estimated (since \(m\) is known)
- \(\pi_0\) represents the overall proportion of tests that are truly null
- BH conservatively assumes that \(\pi_0=1\); q-value estimates it from distribution of p-values
q-value:

\[\hat{q}(p_i) = \min_{i\le j} \Big( \frac{\hat{\pi}_0m}{j}p_{(j)}\Big)\]

q-value and BH-adjusted p-values are equivalent when \(\pi_0=1\)

\[\hat{p}_{BH}(p_i) = \min_{i\le j} \Big(\frac{m}{j}p_{(j)}\Big)\]

BH vs q-value in toy example

Rank \((j)\)	P-value	\(\hat{p}_{BH}(p_i)\)	\(\hat{q}(p_i)\)¹
1	0.0008	0.045	0.045
2	0.009	0.045	0.045
3	0.127	0.423	0.423
4	0.205	0.513	0.513
5	0.396	0.750	0.750
6	0.450	0.750	0.750
7	0.641	0.916	0.916
8	0.781	0.976	0.976
9	0.900	0.993	0.993
10	0.993	0.993	0.993

FDR control is an active area of research

Korthauer*, Kimes* et al. 2019

Assumptions about p-values

Note

Implicit assumption for all multiple testing correction methods: p-value distribution is “well-behaved”

What does this mean?

Primarily, that the distribution of p-values under the null is uniform (i.e. flat)

p-value distributions

Spike of small p-values indicates non-null tests.

Code

# plot histogram of p-values for genotypeNrlKO
hist(pvals)

Great primer on how things can go wrong: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/

What if p-values are poorly behaved?

FDR estimates can be invalid (assumptions are violated)
- Some possible causes: test assumptions violated, model misspecification, uncontrolled variation/batch effects, selection bias/filtering
Possible solution: nonparametric test
- Limitation: lack of flexibility to adjust for multiple covariates

Another possible solution: estimate sampling distribution or p-values “empirically” using resampling techniques
- Permutation: construct a simulated version of your dataset that satisfies the null hypothesis and compute statistic (e.g. shuffle group labels for a two-group comparison); repeat many times and use permutation statistics as your sampling distribution rather than a t, Normal, F, \(\chi^2\), etc
- Limitation: computationally intensive for genomics; not optimal for small sample sizes

Compounding issues of multiple comparisons

What if you’re not only testing 20K genes, but also multiple tests per gene (e.g. multiple contrasts, such as several two-group comparisons)?
Classical procedures for adjustment (low dimensional setting):
- Tukey multiple comparison procedure
- Scheffe multiple comparison procedure
- Bonferroni or Holm FWER correction
In our setting, we can also apply BH to all p-values globally
- limma::decideTests(pvals, method="global") for a matrix of p-values or eBayes output (e.g. rows = genes, columns = contrasts)
- p-values are combined, adjusted globally, then separated back out and sorted

Next up: regression with count measures (RNA-seq)!

So far, all of our modeling assumed our outcome \(Y_i\) (gene expression measurements for a particular gene) were continuous and with constant variance (iid).
From RNA-seq, we obtain discrete, skewed distributions (with non-constant variance) of counts that represent gene expression levels across genes. These violate modeling assumptions for standard linear regression / limma.
Next time, we’ll explore:
- Data transformations and extensions to limma framework to apply deal with these violations
- Generalized linear models (negative binomial regression) to directly model counts

More limma & multiple testing

Learning objectives

Recall: The hybrid estimator in limma

Illustration

Moderated t statistic

Side-by-side comparison of key quantities and results

Moderated (limma) vs traditional (lm) tests

Overview: limma workflow

Functions that make your life easier

Getting help

limma step one: lmFit

Let’s run limma for the interactive model with age (continuous) and genotype (factor)

Interactive model with age and genotype

Computation is fast

Output of Step 1 (lmFit)

Step 2: Moderated t-tests using eBayes

Components of the empirical Bayes estimators

Step 3: topTable() to extract test output

topTable in action: genotypeNrlKO

Plot the top 6 probes for genotypeNrlKO

topTable in action: age

Plot the top 6 probes for age

topTable in action: genotypeNrlKO:age

Plot the top 6 probes for genotypeNrlKO:age

topTable in action: any effect of genotype

Plot the top 6 probes for any effect of genotype

Comparison of \(s_g^2\) and \(\tilde{s}_g^2\) (shrinkage!)

P-value comparison for standard vs moderated t-tests

Multiple testing

Recall the two types of errors

Error rates

Type I Error rate for \(m\) tests

Multiple comparisons in genomics

Controlling FWER with the Bonferroni correction

Can we do better?

How practical is the FWER in high-throughput biology?

Radical idea

Enter: the False Discovery Rate (FDR)

False Discovery Rate

FDR vs FPR vs FWER

Benjamini-Hochberg FDR (BH procedure)

Controlling FDR at level \(q\) = 0.05

BH FDR values given in limma by default

Other ways to control FDR

Storey’s q-value vs BH (Conceptual)

Storey’s q-value vs BH (Mathematical)

BH vs q-value in toy example

FDR control is an active area of research

Assumptions about p-values

p-value distributions

What if p-values are poorly behaved?

Compounding issues of multiple comparisons

Next up: regression with count measures (RNA-seq)!

More `limma` & multiple testing

Recall: The hybrid estimator in `limma`

Moderated (`limma`) vs traditional (`lm`) tests

Overview: `limma` workflow

`limma` step one: `lmFit`

Let’s run `limma` for the interactive model with age (continuous) and genotype (factor)

Output of Step 1 (`lmFit`)

Step 2: Moderated t-tests using `eBayes`

Step 3: `topTable()` to extract test output

`topTable` in action: `genotypeNrlKO`

Plot the top 6 probes for `genotypeNrlKO`

`topTable` in action: `age`

Plot the top 6 probes for `age`

`topTable` in action: `genotypeNrlKO:age`

Plot the top 6 probes for `genotypeNrlKO:age`

`topTable` in action: any effect of genotype

BH FDR values given in `limma` by default