More limma & multiple testing

Keegan Korthauer

February 6, 2025

Learning objectives

  1. Use limma to perform genome-wide differential expression testing on microarray data

  2. Understand the key advantages between limma and lm (standard linear regression)

  3. Explain why multiple testing increases the number of errors we make by chance

  4. Be able to adjust for multiple comparisons by controlling the False Discovery Rate

    • e.g. using Benjamini-Hochberg or Storey’s q-value

Recall: The hybrid estimator in limma

\[\tilde{s}^2_g = \frac{d_0}{d_0 + d}s_0^2 + \frac{d}{d_0 + d}s^2_g\]

Recall that \((s_0, d_0)\) are the prior parameters for \(\sigma^2_g\) (random variable):

\[\frac{1}{\sigma^2_g} \sim \frac{1}{d_0s_0^2} \chi^2_{d_0}\]

The prior parameters incorporate information from all genes which allows us to shrink/nudge the gene-specific variances toward a common consensus

  • they are estimated from the data - the formulas for \((s_0, d_0)\) and their derivations are beyond the scope of the course, but limma takes care of the details for us

  • note that \((s_0, d_0)\) do not depend on \(g\)

Illustration

Figure 5 from https://doi.org/10.1186/s12859-019-3248-9

Moderated t statistic

  • plug in this posterior mean estimate to obtain a ‘moderated’ t statistic:

\[\large \tilde{t}_{gj} = \frac{\hat{\alpha}_{gj}}{\tilde{s}_g \sqrt{v_{jj}}}\]

  • Under limma’s assumptions, we know the null distribution for \(\tilde{t}_{gj}\) under \(H_0\): \[\tilde{t}_{gj} \sim t_{d_0+d}\]

  • parameters from the prior \(d_0\) and \(s_0^2\) are estimated from the data

  • This is how limma is a hybrid of frequentist (t-statistic) and Bayesian (hierarchical model) approaches (i.e. empirical Bayes)

Side-by-side comparison of key quantities and results

OLS limma
Estimated gene-wise residual variance12: \({s}_g^2 = \frac{1}{d} \hat{\boldsymbol\varepsilon}^T \hat{\boldsymbol\varepsilon}\) \(\tilde{s}^2_g = \frac{d_0s_0^2 + ds^2_g}{d_0 + d}\)
t-statistic for \(H_0: \alpha_{gj} = 0\): \({t}_{gj} = \frac{\hat{\alpha}_{gj}}{s_g \sqrt{v_{jj}}}\) \(\tilde{t}_{gj} = \frac{\hat{\alpha}_{gj}}{\tilde{s}_g \sqrt{v_{jj}}}\)
Distribution of t-statistic under \(H_0\): \({t}_{gj} \sim t_{d}\) \(\tilde{t}_{gj} \sim t_{d_0+d}\)

Moderated (limma) vs traditional (lm) tests

  • moderated variances will be “shrunk” toward the typical gene-wise variance, relative to to raw sample residual variances
  • degrees of freedom for null distribution increase relative to default \((d \text{ vs } d_0 + d)\)
    • \(\rightarrow\) makes it closer to a standard normal
    • \(\rightarrow\) makes tail probabilities (p-values) smaller
    • \(\rightarrow\) easier to reject the null
  • overall, when all is well, limma will deliver statistical results that are more stable and more powerful

Overview: limma workflow

Functions that make your life easier

Function Description
model.matrix Takes in your data frame and makes a design matrix
limma::lmFit Fits the linear model to each gene separately – replace gene with “feature” depending on your data (‘industrial scale’ lm)
limma::eBayes Use output of linear regression to compute moderated t statistics
limma::topTable Query your results; sort your p-values; sort genes; Adjust for multiple comparisons
limma::decideTests Identify which genes are significantly differentially expressed for each contrast
limma::makeContrasts Create the contrast matrix \(C\) that you desire
limma::contrast.fit Apply a contrast to your estimates

Getting help

limma step one: lmFit

\[\mathbf{Y}_g = \mathbf{X} \boldsymbol\alpha_g + \boldsymbol\varepsilon_g,\]

  • Within each gene observations are iid / constant variance

  • lmFit() carries out multiple linear regression on each gene

  • Usage: lmFit(object, design)

    • object is a data.frame or matrix with features in rows and samples in columns (G genes by N samples), or other Bioconductor object such as ExpressionSet

    • design is a design matrix (output of model.matrix(y ~ x); N samples by \(p+1\) parameters)

Bioconductor’s ExpressionSet class; image source

Let’s run limma for the interactive model with age (continuous) and genotype (factor)

\[\Large y_{ig} = \theta + \tau_{KO}x_{ig,KO} + \tau_{Age}x_{ig,Age} + \tau_{KO:Age}x_{ig,KO}x_{ig,Age}\]

  • \(i\) indexes mouse, \(g\) indexes genes

  • \(x_{ig,KO}\) is the indicator variable for the NrlKO group

  • \(x_{ig,Age}\) is the continuous age variable

Interactive model with age and genotype

Example gene (but we want to fit this model on all genes):

Code
# read in our dataset
eset <- getGEO("GSE4051", getGPL = FALSE)[[1]]

# recode time points
pData(eset) <- pData(eset) %>%
  mutate(sample_id = geo_accession) %>%
  mutate(dev_stage =  case_when(
    grepl("E16", title) ~ "E16",
    grepl("P2", title) ~ "P2",
    grepl("P6", title) ~ "P6",
    grepl("P10", title) ~ "P10",
    grepl("4 weeks", title) ~ "P28"
  )) %>%
  mutate(genotype = case_when(
    grepl("Nrl-ko", title) ~ "NrlKO",
    grepl("wt", title) ~ "WT"
  ))

# reorder factor levels, add continous age variable
pData(eset) <- pData(eset) %>%
  mutate(dev_stage = fct_relevel(dev_stage, "E16", "P2", "P6", "P10", "P28")) %>%
  mutate(genotype = as.factor(genotype)) %>%
  mutate(genotype = fct_relevel(genotype, "WT", "NrlKO")) %>%
  mutate(age = ifelse(dev_stage == "E16", -4,
                            ifelse(dev_stage == "P2", 2, 
                                   ifelse(dev_stage == "P6", 6, 
                                          ifelse(dev_stage == "P10", 10, 28)))))

# function to return tidy data that merges expression matrix and metadata
toLongerMeta <- function(expset) {
    stopifnot(class(expset) == "ExpressionSet")
    
    expressionMatrix <- lonExpressionressionMatrix <- exprs(expset) %>% 
    as.data.frame() %>%
    rownames_to_column("gene") %>%
    pivot_longer(cols = !gene, 
                 values_to = "expression",
                 names_to = "sample_id") %>%
    left_join(pData(expset) %>% select(sample_id, dev_stage, age, genotype),
            by = "sample_id")
  return(expressionMatrix)
}

# pull out gene of interest
Tmem_dat <- toLongerMeta(eset) %>% 
  filter(gene == "1441811_x_at") %>%
  mutate(gene = "Tmem176a") 

Tmem_plot <- ggplot(Tmem_dat, aes(x = age, y = expression, colour = genotype)) + 
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Tmem176a") +
  ylim(5, 11) +
  xlab("age (days)") 
Tmem_plot

Computation is fast

  • Bioconductor way:
system.time(gFit <- lmFit(eset, 
                          model.matrix(~ genotype*age, data = pData(eset))))
   user  system elapsed 
  0.040   0.003   0.176 
  • Equivalently, using the ‘separate’ way:
gFit <- lmFit(myDat, 
              model.matrix(~ genotype*age, data = myMeta))
  • Using lmFit to fit an interactive model on 45K probesets takes a fraction of a second
  • The time-intensive parts of an analysis: data wrangling, selecting a model, choosing how to parameterize it, and interpreting it

Output of Step 1 (lmFit)

summary(gFit)
                 Length Class  Mode     
coefficients     180404 -none- numeric  
rank                  1 -none- numeric  
assign                4 -none- numeric  
qr                    5 qr     list     
df.residual       45101 -none- numeric  
sigma             45101 -none- numeric  
cov.coefficients     16 -none- numeric  
stdev.unscaled   180404 -none- numeric  
pivot                 4 -none- numeric  
Amean             45101 -none- numeric  
method                1 -none- character
design              156 -none- numeric  
nrow(eset)
Features 
   45101 
nrow(eset)*4
Features 
  180404 
ncol(eset)*4
Samples 
    156 
  • OK… but where are the shrunken variance estimates?? How do I pull out p-values??

  • Actually, we haven’t carried out the empirical Bayesian computation yet – still need to run eBayes()!

Step 2: Moderated t-tests using eBayes

summary(gFit)
                 Length Class  Mode     
coefficients     180404 -none- numeric  
rank                  1 -none- numeric  
assign                4 -none- numeric  
qr                    5 qr     list     
df.residual       45101 -none- numeric  
sigma             45101 -none- numeric  
cov.coefficients     16 -none- numeric  
stdev.unscaled   180404 -none- numeric  
pivot                 4 -none- numeric  
Amean             45101 -none- numeric  
method                1 -none- character
design              156 -none- numeric  
summary(ebFit <- eBayes(gFit))
                 Length Class  Mode     
coefficients     180404 -none- numeric  
rank                  1 -none- numeric  
assign                4 -none- numeric  
qr                    5 qr     list     
df.residual       45101 -none- numeric  
sigma             45101 -none- numeric  
cov.coefficients     16 -none- numeric  
stdev.unscaled   180404 -none- numeric  
pivot                 4 -none- numeric  
Amean             45101 -none- numeric  
method                1 -none- character
design              156 -none- numeric  
df.prior              1 -none- numeric  
s2.prior              1 -none- numeric  
var.prior             4 -none- numeric  
proportion            1 -none- numeric  
s2.post           45101 -none- numeric  
t                180404 -none- numeric  
df.total          45101 -none- numeric  
p.value          180404 -none- numeric  
lods             180404 -none- numeric  
F                 45101 -none- numeric  
F.p.value         45101 -none- numeric  

Components of the empirical Bayes estimators

math description limma::eBayes dimension in lm?
\(s_g^2\) gene-specific residual variance sigma^2 45K \(\checkmark\)
\(d\) residual degrees of freedom \((n-p-1)\) df.residual Scalar1 \(\checkmark\)
\(s_0^2\) mean of inverse \(\chi^2\) prior for \(s_g^2\) s2.prior Scalar
\(d_0\) degrees of freedom for the prior df.prior Scalar
\(\tilde{s}_g^2\) posterior mean of \(s_g^2\) (moderated gene-specific residual variance) s2.post 45K
\(\tilde{t}_g\) moderated t statistics t 45K\(\times\)p

Step 3: topTable() to extract test output

topTable(fit, coef = NULL, number = 10, genelist = fit$genes, adjust.method = "BH",
         sort.by = "B", resort.by = NULL, p.value = 1, lfc = 0, confint = FALSE)
  • Refer to the help page (?topTable) for full details of each argument

  • Summary of key topTable arguments:

    • coef is the argument where you specify the coefficient(s) you want to test for equality with zero (default is NULL; must be specified)
    • number lets you control size of hit list (default is 10)
    • p.value lets you specify a minimum adjusted p-value cutoff (default is 1)
    • lfc lets you specify a minimum observed effect size - log2 fold change (default is 0)
    • sort.by and resort.by give control over the ordering (default is by “B”: log-odds that the gene is differentially expressed)
    • adjust.method specifies how/if to adjust p-values for multiple testing (default is BH)

topTable in action: genotypeNrlKO

topTable(ebFit, coef = "genotypeNrlKO")
                  logFC  AveExpr          t      P.Value    adj.P.Val        B
1450946_at   -4.7134891 8.733103 -15.583683 5.390931e-18 2.431364e-13 28.40938
1442070_at   -0.9669400 8.268122  -9.787407 6.896220e-12 1.555132e-07 16.46086
1443457_at   -1.1530057 6.049428  -9.091412 4.947029e-11 6.029428e-07 14.68843
1422679_s_at -2.1187394 9.495435  -9.002625 6.388954e-11 6.029428e-07 14.45703
1431708_a_at -2.1610625 8.591450  -8.967950 7.062007e-11 6.029428e-07 14.36634
1433050_at   -1.5083887 8.468891  -8.923932 8.021234e-11 6.029428e-07 14.25096
1426288_at   -4.2100657 9.323796  -8.772209 1.246491e-10 8.031144e-07 13.85106
1418108_at    0.8662091 8.260647   8.537984 2.475294e-10 1.395478e-06 13.22713
1450770_at    1.3103576 7.357033   8.128847 8.332369e-10 3.747976e-06 12.11870
1457802_at   -0.8362030 7.778189  -8.117112 8.629984e-10 3.747976e-06 12.08657
  • topTable(ebFit, coef = 2) is equivalent here, but much less informative!!

  • What is the null hypothesis here?

Plot the top 6 probes for genotypeNrlKO

Code
# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = "genotypeNrlKO", number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixGenotype <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixGenotype, 
       aes(x = age, y = expression, color = genotype)) + 
  geom_point() +
  xlab("age (days)") +
  facet_wrap( ~ gene) + 
  stat_smooth(method="lm", se=FALSE, cex=0.5)

topTable in action: age

topTable(ebFit, coef = "age")
                  logFC  AveExpr        t      P.Value    adj.P.Val        B
1451635_at   0.19294061 7.791237 20.78935 3.204717e-22 1.445359e-17 40.60768
1425222_x_at 0.17604111 6.514475 19.72633 1.962954e-21 4.426561e-17 38.79502
1433699_at   0.14875031 6.939973 17.80467 6.442530e-20 7.957605e-16 35.29412
1452243_at   0.15889179 6.840975 17.75658 7.057586e-20 7.957605e-16 35.20253
1435436_at   0.13067119 8.274676 17.61808 9.187120e-20 8.286966e-16 34.93763
1458418_at   0.21277027 6.270187 16.43686 9.326171e-19 7.006477e-15 32.60718
1424977_at   0.07814988 6.505464 16.36079 1.087456e-18 7.006477e-15 32.45260
1431174_at   0.16158579 7.418000 16.01901 2.183055e-18 1.230725e-14 31.75106
1421818_at   0.11250919 7.982278 15.66666 4.531133e-18 2.270651e-14 31.01562
1419069_at   0.09325234 8.222096 15.54775 5.813459e-18 2.621928e-14 30.76457
  • topTable(ebFit, coef = 3) is equivalent here, but much less informative!!

  • What is the null hypothesis here?

Plot the top 6 probes for age

Code
# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = "age", number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixAge <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixAge, 
       aes( x = age, y = expression, color = genotype)) + 
             geom_point() +
             xlab("age (days)") +
             facet_wrap( ~ gene) + 
             stat_smooth(method="lm", se=FALSE, cex=0.5)

topTable in action: genotypeNrlKO:age

topTable(ebFit, coef = "genotypeNrlKO:age")
                  logFC  AveExpr         t      P.Value    adj.P.Val        B
1451635_at   -0.1871800 7.791237 -14.00463 1.684302e-16 7.596369e-12 27.52010
1425222_x_at -0.1700908 6.514475 -13.23451 9.961393e-16 2.246344e-11 25.74877
1441809_at    0.1336356 6.649440  12.55702 5.037221e-15 7.572790e-11 24.13110
1452243_at   -0.1531326 6.840975 -11.88284 2.670693e-14 3.011273e-10 22.46414
1431174_at   -0.1655894 7.418000 -11.39882 9.161231e-14 8.263614e-10 21.23123
1458418_at   -0.2037314 6.270187 -10.92850 3.122990e-13 2.347500e-09 20.00384
1435436_at   -0.1154500 8.274676 -10.80855 4.289363e-13 2.763637e-09 19.68613
1446715_at   -0.1409774 5.613122 -10.57909 7.912260e-13 4.325169e-09 19.07307
1442190_at   -0.1323038 5.896077 -10.54672 8.630965e-13 4.325169e-09 18.98601
1416306_at    0.1702059 6.560050  10.41285 1.238250e-12 5.584630e-09 18.62456
  • topTable(ebFit, coef = 4) is equivalent here, but much less informative!!

  • What is the null hypothesis here?

Plot the top 6 probes for genotypeNrlKO:age

Code
# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = "genotypeNrlKO:age", number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixItx <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixItx, 
       aes( x = age, y = expression, color = genotype)) + 
             geom_point() +
             xlab("age (days)") +
             facet_wrap( ~ gene) + 
             stat_smooth(method="lm", se=FALSE, cex=0.5)

topTable in action: any effect of genotype

topTable(ebFit, coef = c("genotypeNrlKO", "genotypeNrlKO:age")) 
             genotypeNrlKO genotypeNrlKO.age  AveExpr        F      P.Value
1450946_at      -4.7134891       -0.05411298 8.733103 246.5762 2.403638e-22
1451635_at       0.4868705       -0.18718005 7.791237 129.8112 1.307185e-17
1451617_at      -0.9885684       -0.22329233 7.419244 128.9935 1.450111e-17
1425222_x_at     0.3698846       -0.17009077 6.514475 120.0041 4.719613e-17
1441809_at      -0.1576064        0.13363563 6.649440 117.5571 6.594308e-17
1416306_at       0.4278776        0.17020589 6.560050 113.2125 1.212332e-16
1439143_at       0.6570269        0.06170447 5.856507 111.2102 1.616023e-16
1449526_a_at     0.9935801        0.04232958 7.372946 110.5598 1.775851e-16
1452243_at       0.1639373       -0.15313265 6.840975 106.2784 3.344140e-16
1451763_at      -0.6819899       -0.14010359 8.160065 102.7010 5.771358e-16
                adj.P.Val
1450946_at   1.084065e-17
1451635_at   2.180049e-13
1451617_at   2.180049e-13
1425222_x_at 5.321482e-13
1441809_at   5.948198e-13
1416306_at   9.112895e-13
1439143_at   1.001158e-12
1449526_a_at 1.001158e-12
1452243_at   1.675823e-12
1451763_at   2.442992e-12
  • topTable(ebFit, coef = c(2,4)) is equivalent here, but much less informative!!

  • What is the null hypothesis here?

Plot the top 6 probes for any effect of genotype

Code
# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = c("genotypeNrlKO", "genotypeNrlKO:age"), number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixGenotypeMarginal <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixGenotypeMarginal, 
       aes( x = age, y = expression, color = genotype)) + 
             geom_point() +
             xlab("age (days)") +
             facet_wrap( ~ gene) + 
             stat_smooth(method="lm", se=FALSE, cex=0.5)

Comparison of \(s_g^2\) and \(\tilde{s}_g^2\) (shrinkage!)

Fill in the blank (increases or decreases):

  1. For large variances, limma ___________ the gene-specific variance estimates.

  2. For small variances, limma ___________ the gene-specific variance estimates.

Code
df <- data.frame(limma = ebFit$s2.post, 
                 lm = gFit$sigma^2)
shrinkplot <- ggplot(df, aes(y = limma, x = lm)) +
  geom_point(alpha=0.2) +
  geom_abline(intercept=0, slope=1, color="red") +
  ggtitle("Ests of gene-specific variance")
shrinkplot

Code
shrinkplot +
  scale_x_log10() +
  scale_y_log10() 

This plot is on the log-scale to ‘zoom in’ on the low range

P-value comparison for standard vs moderated t-tests

Example from PH525x online book:

Multiple testing

What do we do with our lists of thousands of p-values?

Recall the two types of errors

Error rates

\[\alpha = P(\text{Type I Error}), \text{ } \beta = P(\text{Type II Error}), \text{ Power} = 1- \beta\]

Type I Error rate for \(m\) tests

  • \(P(\text{incorrect decision} | H_0) = \alpha\)

    • let \(\alpha=0.05\)
  • \(P(\text{correct decision} | H_0) = 1-\alpha = 0.95\)

  • \(P(\text{correct decision on }m\text{ tests} | H_0) =\) \[(1-\alpha)^m = 0.95^m\]

  • \(P(\text{at least one error on }m\text{ tests} | H_0) =\) \[1 - (1-\alpha)^m =\] \[1-0.95^m = \alpha_{FWER}\]

  • FWER stands for “Family-wise error rate”

Code
df <- data.frame(m = seq(1,200)) %>%
  mutate(y = 1-0.95^m)
ggplot(df, aes(x = m, y = y)) +
  geom_line(size = 1.5, alpha = 0.7, colour = "dodgerblue") +
  xlab("m (number of tests)") +
  ylab(expression(paste(1-0.95^m, " (FWER)")))

Multiple comparisons in genomics

Controlling FWER with the Bonferroni correction

  • FWER is the probability of making at least one error when testing \(m\) hypotheses
  • Control the FWER: limit the probability of making at least one incorrect decision

  • One example: the Bonferroni correction for \(\alpha=0.05\):

    \[\text{We want: } P(\text{at least one error on }m \text{ tests}) \lt \alpha\]

    \[\text{We know: } P(\text{at least one error on }m \text{ tests}) \lt \sum_{i=1}^m P(\text{error on test }i)\]

    \[\text{We let: } P(\text{error on test }i) = \alpha_{Bon} \Rightarrow \sum_{i=1}^m P(\text{error on test }i) = m \alpha_{Bon} \]

    \[\Rightarrow \alpha_{Bon} = \frac{\alpha}{m} = \frac{0.05}{m} \text{ (Reject if p-value } p_i \lt \frac{\alpha}{m} \text{)}\]

Can we do better?

  • Bonferroni correction is very conservative (i.e. controls the FWER even lower than \(\alpha\) in many settings)

  • Several other options are better

  • For example, the Holm procedure: multiplier for p-value correction is not the same for all genes; more powerful

\[p_{Holm(1)} = mp_{(1)}\]

\[p_{Holm(2)} = (m-1)p_{(2)}\] \[p_{Holm(3)} = (m-2)p_{(3)}\] \[\vdots\] \[\Rightarrow FWER \le \alpha\]

How practical is the FWER in high-throughput biology?

  • Why do we care so much about making one single error??

  • One easy way to ensure no Type I errors: reject no hypotheses! 😃

    • However, then our power is zero… 😭

Important

Being overly strict about Type I error leads to greater Type II error (loss of power)

Radical idea

FWER is not that relevant in high-throughput studies

It’s OK to make multiple errors, as long as you also have a comparatively large number of true positives!

Enter: the False Discovery Rate (FDR)

False Discovery Rate

FDR is designed to control the expected proportion of false positives (V) among all hypotheses where the null has been rejected (R)

\[FDR = E \Big[ \frac{V}{R} \Big]\]

FDR vs FPR vs FWER

  • False Discovery Rate (FDR) is the rate that significant features \((R)\) are truly null

\[FDR = E \Big[ \frac{V}{R} \Big]\]

  • False Positive Rate (FPR) is the rate that truly null features \((m_0)\) are called significant

\[FPR = E\Big[\frac{V}{m_0}\Big]\]

  • Family-Wise Error Rate (FWER) is the probability that the number of truly null features rejected \((V)\) is at least 1

\[\text{FWER } = P(V\ge1) \]

Benjamini-Hochberg FDR (BH procedure)

  • Proposed the idea of controlling FDR instead of FWER

  • Proposed a procedure for doing so

    • note that we know \(R\), but we don’t know \(V\)
  • Procedure: control FDR at level \(q\)

    1. order the raw p-values \(p_{(1)} \le p_{(2)} \le ...\le p_{(m)}\)

    2. find test with highest rank \(j\) such that \(p_{(j)} < \frac{jq}{m}\)

    3. declare all smaller ranks up to \(j\) significant

Controlling FDR at level \(q\) = 0.05

Rank \((j)\) P-value
1 0.0008
2 0.009
3 0.127
4 0.205
5 0.396
6 0.450
7 0.641
8 0.781
9 0.900
10 0.993

\(\text{}\)

Rank \((j)\) P-value BH-threshold: jq/m
1 0.0008 0.005
2 0.009 0.010
3 0.127 0.015
4 0.205 0.020
5 0.396 0.025
6 0.450 0.030
7 0.641 0.035
8 0.781 0.040
9 0.900 0.045
10 0.993 0.050

\(\text{}\)

Rank \((j)\) P-value BH-threshold: jq/m BH-adjusted p-value: \(p_{(i)}^{BH}= \text{min}(\text{min}_{j\ge i}(mp_{(j)}/j), 1)\)
1 0.0008 0.005 0.0080
2 0.009 0.010 0.0450
3 0.127 0.015 0.4233
4 0.205 0.020 0.5125
5 0.396 0.025 0.7500
6 0.450 0.030 0.7500
7 0.641 0.035 0.9157
8 0.781 0.040 0.9763
9 0.900 0.045 0.9930
10 0.993 0.050 0.9930

\(\text{}\)

Rank \((j)\) P-value BH-threshold: jq/m Reject \(H_0\)?
1 0.0008 0.005 \(\checkmark\)
2 0.009 0.010 \(\checkmark\)
3 0.127 0.015
4 0.205 0.020
5 0.396 0.025
6 0.450 0.030
7 0.641 0.035
8 0.781 0.040
9 0.900 0.045
10 0.993 0.050 \(\text{ }\)

\(\text{}\)

Rank \((j)\) P-val BH-thr: jq/m Reject \(H_0\)? \(FWER < 0.05\)?
1 0.0008 0.005 \(\checkmark\) \(\checkmark\)
2 0.009 0.010 \(\checkmark\)
3 0.127 0.015
4 0.205 0.020
5 0.396 0.025
6 0.450 0.030
7 0.641 0.035
8 0.781 0.040
9 0.900 0.045
10 0.993 0.050

Where \(\alpha_{Bon}=0.05/10=0.005\)

BH FDR values given in limma by default

topTable(ebFit, coef = "genotypeNrlKO")
                  logFC  AveExpr          t      P.Value    adj.P.Val        B
1450946_at   -4.7134891 8.733103 -15.583683 5.390931e-18 2.431364e-13 28.40938
1442070_at   -0.9669400 8.268122  -9.787407 6.896220e-12 1.555132e-07 16.46086
1443457_at   -1.1530057 6.049428  -9.091412 4.947029e-11 6.029428e-07 14.68843
1422679_s_at -2.1187394 9.495435  -9.002625 6.388954e-11 6.029428e-07 14.45703
1431708_a_at -2.1610625 8.591450  -8.967950 7.062007e-11 6.029428e-07 14.36634
1433050_at   -1.5083887 8.468891  -8.923932 8.021234e-11 6.029428e-07 14.25096
1426288_at   -4.2100657 9.323796  -8.772209 1.246491e-10 8.031144e-07 13.85106
1418108_at    0.8662091 8.260647   8.537984 2.475294e-10 1.395478e-06 13.22713
1450770_at    1.3103576 7.357033   8.128847 8.332369e-10 3.747976e-06 12.11870
1457802_at   -0.8362030 7.778189  -8.117112 8.629984e-10 3.747976e-06 12.08657

Or, obtain them yourself for any vector of p-values p with p.adjust(p, method="BH"):

pvals <- topTable(ebFit, coef = "genotypeNrlKO", number = Inf)$P.Value
p.adjust(pvals, method = "BH") %>% head()
[1] 2.431364e-13 1.555132e-07 6.029428e-07 6.029428e-07 6.029428e-07
[6] 6.029428e-07

Other ways to control FDR

  • BH is just one (the first) method to control FDR

  • Since the publication of the BH method, other methods have been proposed

  • One of the most popular is Storey’s q-value

  • qvalue package implementation: provides adjusted p-values

Storey’s q-value vs BH (Conceptual)

  • Just like BH, is focused on the proportion of discoveries that are false positives

  • Conceptual difference between BH and Storey’s q-value is:

    • BH controls the FDR

    • q-values give an unbiased estimate of the FDR (will control the FDR on average)

Storey’s q-value vs BH (Mathematical)

  • Mathematically, the difference between the two is in how \(m_0\) is estimated

    • Or equivalently, how \(\pi_0=\frac{m_0}{m}\) is estimated (since \(m\) is known)

    • \(\pi_0\) represents the overall proportion of tests that are truly null

    • BH conservatively assumes that \(\pi_0=1\); q-value estimates it from distribution of p-values

  • q-value:

\[\hat{q}(p_i) = \min_{i\le j} \Big( \frac{\hat{\pi}_0m}{j}p_{(j)}\Big)\]

  • q-value and BH-adjusted p-values are equivalent when \(\pi_0=1\)

\[\hat{p}_{BH}(p_i) = \min_{i\le j} \Big(\frac{m}{j}p_{(j)}\Big)\]

BH vs q-value in toy example

Rank \((j)\) P-value \(\hat{p}_{BH}(p_i)\) \(\hat{q}(p_i)\)1
1 0.0008 0.045 0.045
2 0.009 0.045 0.045
3 0.127 0.423 0.423
4 0.205 0.513 0.513
5 0.396 0.750 0.750
6 0.450 0.750 0.750
7 0.641 0.916 0.916
8 0.781 0.976 0.976
9 0.900 0.993 0.993
10 0.993 0.993 0.993

FDR control is an active area of research

Assumptions about p-values

Note

Implicit assumption for all multiple testing correction methods: p-value distribution is “well-behaved”

What does this mean?

Primarily, that the distribution of p-values under the null is uniform (i.e. flat)

p-value distributions

Spike of small p-values indicates non-null tests.

Code
# plot histogram of p-values for genotypeNrlKO
hist(pvals)

Great primer on how things can go wrong: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/

What if p-values are poorly behaved?

  • FDR estimates can be invalid (assumptions are violated)

    • Some possible causes: test assumptions violated, model misspecification, uncontrolled variation/batch effects, selection bias/filtering
  • Possible solution: nonparametric test

    • Limitation: lack of flexibility to adjust for multiple covariates
  • Another possible solution: estimate sampling distribution or p-values “empirically” using resampling techniques

    • Permutation: construct a simulated version of your dataset that satisfies the null hypothesis and compute statistic (e.g. shuffle group labels for a two-group comparison); repeat many times and use permutation statistics as your sampling distribution rather than a t, Normal, F, \(\chi^2\), etc

    • Limitation: computationally intensive for genomics; not optimal for small sample sizes

Compounding issues of multiple comparisons

  • What if you’re not only testing 20K genes, but also multiple tests per gene (e.g. multiple contrasts, such as several two-group comparisons)?

  • Classical procedures for adjustment (low dimensional setting):

    • Tukey multiple comparison procedure
    • Scheffe multiple comparison procedure
    • Bonferroni or Holm FWER correction
  • In our setting, we can also apply BH to all p-values globally

    • limma::decideTests(pvals, method="global") for a matrix of p-values or eBayes output (e.g. rows = genes, columns = contrasts)

    • p-values are combined, adjusted globally, then separated back out and sorted

Next up: regression with count measures (RNA-seq)!

  • So far, all of our modeling assumed our outcome \(Y_i\) (gene expression measurements for a particular gene) were continuous and with constant variance (iid).

  • From RNA-seq, we obtain discrete, skewed distributions (with non-constant variance) of counts that represent gene expression levels across genes. These violate modeling assumptions for standard linear regression / limma.

  • Next time, we’ll explore:

    • Data transformations and extensions to limma framework to apply deal with these violations

    • Generalized linear models (negative binomial regression) to directly model counts