More limma & multiple testing

Keegan Korthauer

February 8, 2024

Learning objectives

  1. Use limma to perform genome-wide differential expression testing on microarray data

  2. Understand the key differences between limma and lm

  3. Explain why multiple testing increases the number of errors we make by chance

  4. Be able to adjust for multiple comparisons by controlling the False Discovery Rate

    1. e.g. using Benjamini-Hochberg or Storey’s q-value

Recall: The hybrid estimator in limma

\[\tilde{s}^2_g = \frac{d_0s_0^2 + ds^2_g}{d_0 + d}\]

Recall that \((s_0, d_0)\) are the prior parameters for \(\sigma^2_g\) (random variable):

\[\frac{1}{\sigma^2_g} \sim \frac{1}{d_0s_0^2} \chi^2_{d_0}\]

The prior parameters incorporate information from all genes which allows us to shrink/nudge the gene-specific variances toward a common consensus

  • they are estimated from the data - the formulas for \((s_0, d_0)\) and their derivations are beyond the scope of the course, but limma takes care of the details for us

  • note that \((s_0, d_0)\) do not depend on \(g\)

Preview: limma workflow

Functions that make your life easier

Function Description
model.matrix Takes in your data frame and makes a design matrix
limma::lmFit Fits the linear model to each gene separately – replace gene with “feature” depending on your data (‘industrial scale’ lm)
limma::eBayes Use output of linear regression to compute moderated t statistics
limma::topTable Query your results; sort your p-values; sort genes; Adjust for multiple comparisons
limma::decideTests Identify which genes are significantly differentially expressed for each contrast
limma::makeContrasts Create the contrast matrix \(C\) that you desire
limma::contrast.fit Apply a contrast to your estimates

Getting help

limma step one: lmFit

\[\mathbf{Y}_g = \mathbf{X} \boldsymbol\alpha_g + \boldsymbol\varepsilon_g,\]

  • Within each gene observations are iid / constant variance

  • lmFit() carries out multiple linear regression on each gene

  • Usage: lmFit(object, design)

    • object is a data.frame or matrix with features in rows and samples in columns (G genes by N samples), or other Bioconductor object such as ExpressionSet

    • design is a design matrix (output of model.matrix(y ~ x); N samples by \(p\) parameters)

Bioconductor’s ExpressionSet class; image source

Let’s run limma for the interactive model with age (continuous) and genotype (factor)

\[\Large y_{ig} = \theta + \tau_{KO}x_{ig,KO} + \tau_{Age}x_{ig,Age} + \tau_{KO:Age}x_{ig,KO}x_{ig,Age}\]

  • \(i\) indexes mouse, \(g\) indexes genes

  • \(x_{ig,KO}\) is the indicator variable for the NrlKO group

  • \(x_{ig,Age}\) is the continuous age variable

Interactive model with age and genotype

Example gene (but we want to fit this model on all genes):

Code
library(GEOquery)

# read in our dataset
eset <- getGEO("GSE4051", getGPL = FALSE)[[1]]

# recode time points
pData(eset) <- pData(eset) %>%
  mutate(sample_id = geo_accession) %>%
  mutate(dev_stage =  case_when(
    grepl("E16", title) ~ "E16",
    grepl("P2", title) ~ "P2",
    grepl("P6", title) ~ "P6",
    grepl("P10", title) ~ "P10",
    grepl("4 weeks", title) ~ "P28"
  )) %>%
  mutate(genotype = case_when(
    grepl("Nrl-ko", title) ~ "NrlKO",
    grepl("wt", title) ~ "WT"
  ))

# reorder factor levels, add continous age variable
pData(eset) <- pData(eset) %>%
  mutate(dev_stage = fct_relevel(dev_stage, "E16", "P2", "P6", "P10", "P28")) %>%
  mutate(genotype = as.factor(genotype)) %>%
  mutate(genotype = fct_relevel(genotype, "WT", "NrlKO")) %>%
  mutate(age = ifelse(dev_stage == "E16", -4,
                            ifelse(dev_stage == "P2", 2, 
                                   ifelse(dev_stage == "P6", 6, 
                                          ifelse(dev_stage == "P10", 10, 28)))))

# function to return tidy data that merges expression matrix and metadata
toLongerMeta <- function(expset) {
    stopifnot(class(expset) == "ExpressionSet")
    
    expressionMatrix <- lonExpressionressionMatrix <- exprs(expset) %>% 
    as.data.frame() %>%
    rownames_to_column("gene") %>%
    pivot_longer(cols = !gene, 
                 values_to = "expression",
                 names_to = "sample_id") %>%
    left_join(pData(expset) %>% select(sample_id, dev_stage, age, genotype),
            by = "sample_id")
  return(expressionMatrix)
}

# pull out gene of interest
Tmem_dat <- toLongerMeta(eset) %>% 
  filter(gene == "1441811_x_at") %>%
  mutate(gene = "Tmem176a") 

Tmem_plot <- ggplot(Tmem_dat, aes(x = age, y = expression, colour = genotype)) + 
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Tmem176a") +
  ylim(5, 11) +
  xlab("age (days)") 
Tmem_plot

Arranging lmFit input: Bioconductor way

eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 45101 features, 39 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM92610 GSM92611 ... GSM92648 (39 total)
  varLabels: title geo_accession ... age (40 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 16505381 
Annotation: GPL1261 

Arranging lmFit input: Separate expression/metadata

myDat <- assayData(eset)$exprs
dim(myDat)
[1] 45101    39
head(myDat)
             GSM92610 GSM92611  GSM92612 GSM92613  GSM92614  GSM92615 GSM92616
1415670_at   7.108863 7.322392  7.420947 7.351444  7.240428  7.336860 7.382548
1415671_at   9.714002 9.797742  9.831072 9.658442  9.708906 10.030904 7.636955
1415672_at   9.429030 9.846977 10.003092 9.914112 10.174548 10.242718 8.422627
1415673_at   8.426974 8.404206  8.594600 8.404206  8.835334  8.371316 8.362592
1415674_a_at 8.498338 8.458287  8.426651 8.372776  8.541722  8.892941 8.509461
1415675_at   9.739536 9.508494  9.598063 9.454508  9.672208  9.613500 9.655965
             GSM92617  GSM92618 GSM92619 GSM92620 GSM92621 GSM92622 GSM92623
1415670_at   7.215360  7.215360 7.118826 7.335319 7.087594 7.154789 7.076775
1415671_at   9.213571  9.488295 9.149957 9.330127 9.564756 9.554756 9.321551
1415672_at   9.590419 10.034838 9.842881 9.752964 9.881279 9.842790 9.241509
1415673_at   8.309017  8.546114 8.322816 8.430680 8.574776 8.329279 8.297757
1415674_a_at 8.309017  8.495662 8.209640 8.481508 8.586018 8.495662 8.480084
1415675_at   9.754453  9.806620 9.723386 9.700739 9.865178 9.970262 9.924127
             GSM92624  GSM92625 GSM92626  GSM92627 GSM92628 GSM92629 GSM92630
1415670_at   7.112767  7.281802 7.152759  7.176287 7.149683 7.249942 7.034715
1415671_at   8.235927 10.289950 9.872989 10.159710 8.952893 9.664297 8.381062
1415672_at   9.134840  9.907832 9.675417  9.723334 9.301134 9.514345 9.206305
1415673_at   8.126502  8.422610 8.281813  8.317929 8.173417 8.491196 8.753887
1415674_a_at 8.327027  8.680725 8.496576  8.495662 8.413914 8.418862 8.256568
1415675_at   9.753508 10.209366 9.780233  9.712202 9.629302 9.669490 9.547021
             GSM92631 GSM92632  GSM92633  GSM92634  GSM92635 GSM92636 GSM92637
1415670_at   7.131226 7.374397  7.413965  7.236344  7.070270 7.168648 7.007006
1415671_at   8.730366 9.435807 10.016159  9.477942 10.125344 9.853912 8.972240
1415672_at   9.525853 9.484793 10.041910 10.013613  9.907014 9.913023 9.219221
1415673_at   8.647076 8.494675  8.374498  8.362059  8.486530 8.403623 8.419167
1415674_a_at 8.283164 8.339452  8.615350  8.585476  8.641078 8.520345 8.527555
1415675_at   9.454296 9.327131  9.718741  9.590773  9.699846 9.708509 9.652533
             GSM92638 GSM92639  GSM92640 GSM92641  GSM92642 GSM92643 GSM92644
1415670_at   6.812339 7.545176  7.153775 7.181640  7.190204 7.108709 7.215360
1415671_at   8.829574 9.524798  9.218747 9.487902  9.159176 9.747169 8.636329
1415672_at   9.394474 9.917235 10.058851 9.412131 10.112277 9.393012 9.427566
1415673_at   8.100682 8.782033  8.348986 8.734018  8.198618 8.365879 8.327091
1415674_a_at 8.319203 8.573714  8.446919 8.392846  8.501715 8.361249 8.427396
1415675_at   9.341688 9.672208  9.504651 9.868285  9.654112 9.638198 9.673790
              GSM92645  GSM92646  GSM92647 GSM92648
1415670_at    7.309237  7.259025  7.660502 7.339120
1415671_at    9.262298  9.271657 10.031862 8.265807
1415672_at   10.102144 10.538415  9.885323 9.471748
1415673_at    8.372551  8.148068  8.557602 8.381085
1415674_a_at  8.490949  8.548849  8.692265 8.405225
1415675_at    9.569368  9.483743  9.782834 9.678281
myMeta <- pData(eset)
head(myMeta)
                                         title geo_accession
GSM92610 Nrl-ko-Gfp 4 weeks retina replicate 1      GSM92610
GSM92611        Nrl-ko-Gfp 4 weeks replicate 2      GSM92611
GSM92612        Nrl-ko-Gfp 4 weeks replicate 3      GSM92612
GSM92613        Nrl-ko-Gfp 4 weeks replicate 4      GSM92613
GSM92614            Nrl-ko-Gfp E16 replicate 1      GSM92614
GSM92615            Nrl-ko-Gfp E16 replicate 2      GSM92615
                        status submission_date last_update_date type
GSM92610 Public on Feb 28 2006     Jan 16 2006      Aug 28 2018  RNA
GSM92611 Public on Feb 28 2006     Jan 16 2006      Aug 28 2018  RNA
GSM92612 Public on Feb 28 2006     Jan 16 2006      Aug 28 2018  RNA
GSM92613 Public on Feb 28 2006     Jan 16 2006      Aug 28 2018  RNA
GSM92614 Public on Feb 28 2006     Jan 16 2006      Aug 28 2018  RNA
GSM92615 Public on Feb 28 2006     Jan 16 2006      Aug 28 2018  RNA
         channel_count                   source_name_ch1 organism_ch1
GSM92610             1                 Gfp purified cell Mus musculus
GSM92611             1 purified Gfp+ photoreceptor cells Mus musculus
GSM92612             1          Gfp+ photoreceptor cells Mus musculus
GSM92613             1          Gfp+ photoreceptor cells Mus musculus
GSM92614             1          Gfp+ photoreceptor cells Mus musculus
GSM92615             1          Gfp+ photoreceptor cells Mus musculus
                              characteristics_ch1 molecule_ch1
GSM92610 Nrl-ko, purified Gfp+ photoreceptor cell    total RNA
GSM92611     Nrl-ko-Gfp, Gfp+ photoreceptor cells    total RNA
GSM92612     Nrl-ko-Gfp, Gfp+ photoreceptor cells    total RNA
GSM92613     Nrl-ko-Gfp, Gfp+ photoreceptor cells    total RNA
GSM92614     Nrl-ko-Gfp, Gfp+ photoreceptor cells    total RNA
GSM92615     Nrl-ko-Gfp, Gfp+ photoreceptor cells    total RNA
                        extract_protocol_ch1 label_ch1  label_protocol_ch1
GSM92610 mRNA was amplified using Nugene kit    Biotin Nugene kit protocol
GSM92611 mRNA was amplified using Nugene kit    Biotin Nugene kit protocol
GSM92612 mRNA was amplified using Nugene kit    Biotin Nugene kit protocol
GSM92613 mRNA was amplified using Nugene kit    Biotin Nugene kit protocol
GSM92614 mRNA was amplified using Nugene kit    Biotin Nugene kit protocol
GSM92615 mRNA was amplified using Nugene kit    Biotin Nugene kit protocol
         taxid_ch1        hyb_protocol       scan_protocol
GSM92610     10090 Affymetrix standard Affymetrix standard
GSM92611     10090 Affymetrix standard Affymetrix standard
GSM92612     10090 Affymetrix standard Affymetrix standard
GSM92613     10090 Affymetrix standard Affymetrix standard
GSM92614     10090 Affymetrix standard Affymetrix standard
GSM92615     10090 Affymetrix standard Affymetrix standard
                                                                                                                                                                                     description
GSM92610 Nrl-ko mice were mated with wt-Gfp mice in which Gfp expression is driven by Nrl promoter. Gfp+ photoreceptors were purified by FACS and RNA was extracted and amplified by Nugene kit.
GSM92611 Nrl-ko mice were mated with wt-Gfp mice in which Gfp expression is driven by Nrl promoter. Gfp+ photoreceptors were purified by FACS and RNA was extracted and amplified by Nugene kit.
GSM92612 Nrl-ko mice were mated with wt-Gfp mice in which Gfp expression is driven by Nrl promoter. Gfp+ photoreceptors were purified by FACS and RNA was extracted and amplified by Nugene kit.
GSM92613 Nrl-ko mice were mated with wt-Gfp mice in which Gfp expression is driven by Nrl promoter. Gfp+ photoreceptors were purified by FACS and RNA was extracted and amplified by Nugene kit.
GSM92614 Nrl-ko mice were mated with wt-Gfp mice in which Gfp expression is driven by Nrl promoter. Gfp+ photoreceptors were purified by FACS and RNA was extracted and amplified by Nugene kit.
GSM92615 Nrl-ko mice were mated with wt-Gfp mice in which Gfp expression is driven by Nrl promoter. Gfp+ photoreceptors were purified by FACS and RNA was extracted and amplified by Nugene kit.
                                   description.1 description.2 data_processing
GSM92610      This is Nrl-ko-Gfp 4 weeks sample.                           RMA
GSM92611  This is Nrl-ko-Gfp 4 weeks replicate 2                           RMA
GSM92612  This is Nrl-ko-Gfp 4 weeks replicate 3                           RMA
GSM92613 This is Nrl-ko-Gfp 4 weeks replicate 4.                           RMA
GSM92614     This is Nrl-ko-Gfp E16 replicate 1.                           RMA
GSM92615     This is Nrl-ko-Gfp E16 replicate 2.                           RMA
         platform_id   contact_name     contact_email contact_phone
GSM92610     GPL1261 Swaroop,,Anand swaroop@umich.edu  734-615 2246
GSM92611     GPL1261 Swaroop,,Anand swaroop@umich.edu  734-615 2246
GSM92612     GPL1261 Swaroop,,Anand swaroop@umich.edu  734-615 2246
GSM92613     GPL1261 Swaroop,,Anand swaroop@umich.edu  734-615 2246
GSM92614     GPL1261 Swaroop,,Anand swaroop@umich.edu  734-615 2246
GSM92615     GPL1261 Swaroop,,Anand swaroop@umich.edu  734-615 2246
                      contact_department      contact_institute contact_address
GSM92610 Ophthalmology & Visual Sciences University of Michigan   1000 Wall St.
GSM92611 Ophthalmology & Visual Sciences University of Michigan   1000 Wall St.
GSM92612 Ophthalmology & Visual Sciences University of Michigan   1000 Wall St.
GSM92613 Ophthalmology & Visual Sciences University of Michigan   1000 Wall St.
GSM92614 Ophthalmology & Visual Sciences University of Michigan   1000 Wall St.
GSM92615 Ophthalmology & Visual Sciences University of Michigan   1000 Wall St.
         contact_city contact_state contact_zip/postal_code contact_country
GSM92610    Ann Arbor            MI                   48105             USA
GSM92611    Ann Arbor            MI                   48105             USA
GSM92612    Ann Arbor            MI                   48105             USA
GSM92613    Ann Arbor            MI                   48105             USA
GSM92614    Ann Arbor            MI                   48105             USA
GSM92615    Ann Arbor            MI                   48105             USA
                     contact_web_link
GSM92610 http://www.umich.edu/~retina
GSM92611 http://www.umich.edu/~retina
GSM92612 http://www.umich.edu/~retina
GSM92613 http://www.umich.edu/~retina
GSM92614 http://www.umich.edu/~retina
GSM92615 http://www.umich.edu/~retina
                                                                     supplementary_file
GSM92610 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM92nnn/GSM92610/suppl/GSM92610.CEL.gz
GSM92611 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM92nnn/GSM92611/suppl/GSM92611.CEL.gz
GSM92612 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM92nnn/GSM92612/suppl/GSM92612.CEL.gz
GSM92613 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM92nnn/GSM92613/suppl/GSM92613.CEL.gz
GSM92614 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM92nnn/GSM92614/suppl/GSM92614.CEL.gz
GSM92615 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM92nnn/GSM92615/suppl/GSM92615.CEL.gz
         data_row_count                 relation sample_id dev_stage genotype
GSM92610          45101 Reanalyzed by: GSE119085  GSM92610       P28    NrlKO
GSM92611          45101 Reanalyzed by: GSE119085  GSM92611       P28    NrlKO
GSM92612          45101 Reanalyzed by: GSE119085  GSM92612       P28    NrlKO
GSM92613          45101 Reanalyzed by: GSE119085  GSM92613       P28    NrlKO
GSM92614          45101 Reanalyzed by: GSE119085  GSM92614       E16    NrlKO
GSM92615          45101 Reanalyzed by: GSE119085  GSM92615       E16    NrlKO
         age
GSM92610  28
GSM92611  28
GSM92612  28
GSM92613  28
GSM92614  -4
GSM92615  -4

Formulating lmFit input: Design Matrix

Bioconductor way:

desMat <- model.matrix(~ genotype*age, 
                       data = pData(eset))

Equivalently, if using the separate way:

desMat <- model.matrix(~ genotype*age, 
                       data = myMeta)
desMat 
         (Intercept) genotypeNrlKO age genotypeNrlKO:age
GSM92610           1             1  28                28
GSM92611           1             1  28                28
GSM92612           1             1  28                28
GSM92613           1             1  28                28
GSM92614           1             1  -4                -4
GSM92615           1             1  -4                -4
GSM92616           1             1  -4                -4
GSM92617           1             1  10                10
GSM92618           1             1  10                10
GSM92619           1             1  10                10
GSM92620           1             1  10                10
GSM92621           1             1   2                 2
GSM92622           1             1   2                 2
GSM92623           1             1   2                 2
GSM92624           1             1   2                 2
GSM92625           1             1   6                 6
GSM92626           1             1   6                 6
GSM92627           1             1   6                 6
GSM92628           1             1   6                 6
GSM92629           1             0  28                 0
GSM92630           1             0  28                 0
GSM92631           1             0  28                 0
GSM92632           1             0  28                 0
GSM92633           1             0  -4                 0
GSM92634           1             0  -4                 0
GSM92635           1             0  -4                 0
GSM92636           1             0  -4                 0
GSM92637           1             0  10                 0
GSM92638           1             0  10                 0
GSM92639           1             0  10                 0
GSM92640           1             0  10                 0
GSM92641           1             0   2                 0
GSM92642           1             0   2                 0
GSM92643           1             0   2                 0
GSM92644           1             0   2                 0
GSM92645           1             0   6                 0
GSM92646           1             0   6                 0
GSM92647           1             0   6                 0
GSM92648           1             0   6                 0
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$genotype
[1] "contr.treatment"

Important

Using Bioconductor objects that store the measurements and metadata in one object reduces the risk of errors due to subsetting, reordering, etc.

Computation is fast

  • Bioconductor way:
system.time(gFit <- lmFit(eset, 
                          model.matrix(~ genotype*age, data = pData(eset))))
   user  system elapsed 
  0.156   0.041   0.200 
  • Equivalently, using the ‘separate’ way:
gFit <- lmFit(myDat, 
              model.matrix(~ genotype*age, data = myMeta))
  • Using lmFit to fit an interactive model on 45K probesets takes a fraction of a second
  • The time-intensive parts of an analysis: data wrangling, selecting a model, choosing how to parameterize it, and interpreting it

Output of lmFit

summary(gFit)
                 Length Class  Mode     
coefficients     180404 -none- numeric  
rank                  1 -none- numeric  
assign                4 -none- numeric  
qr                    5 qr     list     
df.residual       45101 -none- numeric  
sigma             45101 -none- numeric  
cov.coefficients     16 -none- numeric  
stdev.unscaled   180404 -none- numeric  
pivot                 4 -none- numeric  
Amean             45101 -none- numeric  
method                1 -none- character
design              156 -none- numeric  
nrow(eset)
Features 
   45101 
nrow(eset)*4
Features 
  180404 
ncol(eset)*4
Samples 
    156 
  • OK… but where are the shrunken variance estimates?? How do I pull out p-values??

  • Actually, we haven’t carried out the empirical Bayesian computation yet – still need to run eBayes()!

Moderated t-tests using eBayes

summary(gFit)
                 Length Class  Mode     
coefficients     180404 -none- numeric  
rank                  1 -none- numeric  
assign                4 -none- numeric  
qr                    5 qr     list     
df.residual       45101 -none- numeric  
sigma             45101 -none- numeric  
cov.coefficients     16 -none- numeric  
stdev.unscaled   180404 -none- numeric  
pivot                 4 -none- numeric  
Amean             45101 -none- numeric  
method                1 -none- character
design              156 -none- numeric  
summary(ebFit <- eBayes(gFit))
                 Length Class  Mode     
coefficients     180404 -none- numeric  
rank                  1 -none- numeric  
assign                4 -none- numeric  
qr                    5 qr     list     
df.residual       45101 -none- numeric  
sigma             45101 -none- numeric  
cov.coefficients     16 -none- numeric  
stdev.unscaled   180404 -none- numeric  
pivot                 4 -none- numeric  
Amean             45101 -none- numeric  
method                1 -none- character
design              156 -none- numeric  
df.prior              1 -none- numeric  
s2.prior              1 -none- numeric  
var.prior             4 -none- numeric  
proportion            1 -none- numeric  
s2.post           45101 -none- numeric  
t                180404 -none- numeric  
df.total          45101 -none- numeric  
p.value          180404 -none- numeric  
lods             180404 -none- numeric  
F                 45101 -none- numeric  
F.p.value         45101 -none- numeric  

Components of the empirical Bayes estimators

math description limma::eBayes dimension in lm?
\(s_g^2\) gene-specific residual variance sigma^2 45K \(\checkmark\)
\(d\) residual degrees of freedom \((n-p)\) df.residual Scalar1 \(\checkmark\)
\(s_0^2\) mean of inverse \(\chi^2\) prior for \(s_g^2\) s2.prior Scalar
\(d_0\) degrees of freedom for the prior df.prior Scalar
\(\tilde{s}_g^2\) posterior mean of \(s_g^2\) (moderated gene-specific residual variance) s2.post 45K
\(\tilde{t}_g\) moderated t statistics t 45K\(\times\)p

topTable() extracts output in a convenient format!

topTable(fit, coef = NULL, number = 10, genelist = fit$genes, adjust.method = "BH",
         sort.by = "B", resort.by = NULL, p.value = 1, lfc = 0, confint = FALSE)
  • Refer to the help page (?topTable) for full details of each argument

  • Summary of key topTable arguments:

    • coef is the argument where you specify the coefficient(s) you want to test for equality with zero (default is NULL; must be specified)
    • number lets you control size of hit list (default is 10)
    • p.value lets you specify a minimum adjusted p-value cutoff (default is 1)
    • lfc lets you specify a minimum observed effect size - log2 fold change (default is 0)
    • sort.by and resort.by give control over the ordering (default is by “B”: log-odds that the gene is differentially expressed)
    • adjust.method specifies how/if to adjust p-values for multiple testing (default is BH)

topTable in action: genotypeNrlKO

topTable(ebFit, coef = "genotypeNrlKO")
                  logFC  AveExpr          t      P.Value    adj.P.Val        B
1450946_at   -4.7134891 8.733103 -15.583683 5.390931e-18 2.431364e-13 28.40938
1442070_at   -0.9669400 8.268122  -9.787407 6.896220e-12 1.555132e-07 16.46086
1443457_at   -1.1530057 6.049428  -9.091412 4.947029e-11 6.029428e-07 14.68843
1422679_s_at -2.1187394 9.495435  -9.002625 6.388954e-11 6.029428e-07 14.45703
1431708_a_at -2.1610625 8.591450  -8.967950 7.062007e-11 6.029428e-07 14.36634
1433050_at   -1.5083887 8.468891  -8.923932 8.021234e-11 6.029428e-07 14.25096
1426288_at   -4.2100657 9.323796  -8.772209 1.246491e-10 8.031144e-07 13.85106
1418108_at    0.8662091 8.260647   8.537984 2.475294e-10 1.395478e-06 13.22713
1450770_at    1.3103576 7.357033   8.128847 8.332369e-10 3.747976e-06 12.11870
1457802_at   -0.8362030 7.778189  -8.117112 8.629984e-10 3.747976e-06 12.08657
  • topTable(ebFit, coef = 2) is equivalent here, but much less informative!!

  • What is the null hypothesis here?

Plot the top 6 probes for genotypeNrlKO

Code
# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = "genotypeNrlKO", number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixGenotype <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixGenotype, 
       aes(x = age, y = expression, color = genotype)) + 
  geom_point() +
  xlab("age (days)") +
  facet_wrap( ~ gene) + 
  stat_smooth(method="lm", se=FALSE, cex=0.5)

topTable in action: age

topTable(ebFit, coef = "age")
                  logFC  AveExpr        t      P.Value    adj.P.Val        B
1451635_at   0.19294061 7.791237 20.78935 3.204717e-22 1.445359e-17 40.60768
1425222_x_at 0.17604111 6.514475 19.72633 1.962954e-21 4.426561e-17 38.79502
1433699_at   0.14875031 6.939973 17.80467 6.442530e-20 7.957605e-16 35.29412
1452243_at   0.15889179 6.840975 17.75658 7.057586e-20 7.957605e-16 35.20253
1435436_at   0.13067119 8.274676 17.61808 9.187120e-20 8.286966e-16 34.93763
1458418_at   0.21277027 6.270187 16.43686 9.326171e-19 7.006477e-15 32.60718
1424977_at   0.07814988 6.505464 16.36079 1.087456e-18 7.006477e-15 32.45260
1431174_at   0.16158579 7.418000 16.01901 2.183055e-18 1.230725e-14 31.75106
1421818_at   0.11250919 7.982278 15.66666 4.531133e-18 2.270651e-14 31.01562
1419069_at   0.09325234 8.222096 15.54775 5.813459e-18 2.621928e-14 30.76457
  • topTable(ebFit, coef = 3) is equivalent here, but much less informative!!

  • What is the null hypothesis here?

Plot the top 6 probes for age

Code
# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = "age", number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixAge <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixAge, 
       aes( x = age, y = expression, color = genotype)) + 
             geom_point() +
             xlab("age (days)") +
             facet_wrap( ~ gene) + 
             stat_smooth(method="lm", se=FALSE, cex=0.5)

topTable in action: genotypeNrlKO:age

topTable(ebFit, coef = "genotypeNrlKO:age")
                  logFC  AveExpr         t      P.Value    adj.P.Val        B
1451635_at   -0.1871800 7.791237 -14.00463 1.684302e-16 7.596369e-12 27.52010
1425222_x_at -0.1700908 6.514475 -13.23451 9.961393e-16 2.246344e-11 25.74877
1441809_at    0.1336356 6.649440  12.55702 5.037221e-15 7.572790e-11 24.13110
1452243_at   -0.1531326 6.840975 -11.88284 2.670693e-14 3.011273e-10 22.46414
1431174_at   -0.1655894 7.418000 -11.39882 9.161231e-14 8.263614e-10 21.23123
1458418_at   -0.2037314 6.270187 -10.92850 3.122990e-13 2.347500e-09 20.00384
1435436_at   -0.1154500 8.274676 -10.80855 4.289363e-13 2.763637e-09 19.68613
1446715_at   -0.1409774 5.613122 -10.57909 7.912260e-13 4.325169e-09 19.07307
1442190_at   -0.1323038 5.896077 -10.54672 8.630965e-13 4.325169e-09 18.98601
1416306_at    0.1702059 6.560050  10.41285 1.238250e-12 5.584630e-09 18.62456
  • topTable(ebFit, coef = 4) is equivalent here, but much less informative!!

  • What is the null hypothesis here?

Plot the top 6 probes for genotypeNrlKO:age

Code
# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = "genotypeNrlKO:age", number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixItx <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixItx, 
       aes( x = age, y = expression, color = genotype)) + 
             geom_point() +
             xlab("age (days)") +
             facet_wrap( ~ gene) + 
             stat_smooth(method="lm", se=FALSE, cex=0.5)

topTable in action: any effect of genotype

topTable(ebFit, coef = c("genotypeNrlKO", "genotypeNrlKO:age")) 
             genotypeNrlKO genotypeNrlKO.age  AveExpr        F      P.Value
1450946_at      -4.7134891       -0.05411298 8.733103 246.5762 2.403638e-22
1451635_at       0.4868705       -0.18718005 7.791237 129.8112 1.307185e-17
1451617_at      -0.9885684       -0.22329233 7.419244 128.9935 1.450111e-17
1425222_x_at     0.3698846       -0.17009077 6.514475 120.0041 4.719613e-17
1441809_at      -0.1576064        0.13363563 6.649440 117.5571 6.594308e-17
1416306_at       0.4278776        0.17020589 6.560050 113.2125 1.212332e-16
1439143_at       0.6570269        0.06170447 5.856507 111.2102 1.616023e-16
1449526_a_at     0.9935801        0.04232958 7.372946 110.5598 1.775851e-16
1452243_at       0.1639373       -0.15313265 6.840975 106.2784 3.344140e-16
1451763_at      -0.6819899       -0.14010359 8.160065 102.7010 5.771358e-16
                adj.P.Val
1450946_at   1.084065e-17
1451635_at   2.180049e-13
1451617_at   2.180049e-13
1425222_x_at 5.321482e-13
1441809_at   5.948198e-13
1416306_at   9.112895e-13
1439143_at   1.001158e-12
1449526_a_at 1.001158e-12
1452243_at   1.675823e-12
1451763_at   2.442992e-12
  • topTable(ebFit, coef = c(2,4)) is equivalent here, but much less informative!!

  • What is the null hypothesis here?

Plot the top 6 probes for any effect of genotype

Code
# grab the gene names of the top 6 genes from topTable
keep <- topTable(ebFit, coef = c("genotypeNrlKO", "genotypeNrlKO:age"), number = 6) %>%
  rownames()

# extract the data for all 6 genes in tidy format
topSixGenotypeMarginal <- toLongerMeta(eset) %>% 
  filter(gene %in% keep) %>%
  mutate(gene = factor(gene, levels = keep))

ggplot(topSixGenotypeMarginal, 
       aes( x = age, y = expression, color = genotype)) + 
             geom_point() +
             xlab("age (days)") +
             facet_wrap( ~ gene) + 
             stat_smooth(method="lm", se=FALSE, cex=0.5)

Comparison of \(s_g^2\) and \(\tilde{s}_g^2\) (shrinkage!)

Fill in the blank (increases or decreases):

  1. For large variances, limma ___________ the gene-specific variance estimates.

  2. For small variances, limma ___________ the gene-specific variance estimates.

Code
df <- data.frame(limma = ebFit$s2.post, 
                 lm = gFit$sigma^2)
shrinkplot <- ggplot(df, aes(y = limma, x = lm)) +
  geom_point(alpha=0.2) +
  geom_abline(intercept=0, slope=1, color="red") +
  ggtitle("Ests of gene-specific variance")
shrinkplot

Code
shrinkplot +
  scale_x_log10() +
  scale_y_log10() 

This plot is on the log-scale to ‘zoom in’ on the low range

Comparison of interaction coefficient p-values

  • 17479 genes where limma p-value is larger than lm
  • 27622 genes where limma p-value is smaller than lm

Multiple testing

What do we do with our lists of thousands of p-values?

Recall the two types of errors

Error rates

\[\alpha = P(\text{Type I Error}), \text{ } \beta = P(\text{Type II Error}), \text{ Power} = 1- \beta\]

Type I Error rate for \(m\) tests

  • \(P(\text{incorrect decision} | H_0) = \alpha\)

    • let \(\alpha=0.05\)
  • \(P(\text{correct decision} | H_0) = 1-\alpha = 0.95\)

  • \(P(\text{correct decision on }m\text{ tests} | H_0) =\) \[(1-\alpha)^m = 0.95^m\]

  • \(P(\text{at least one error on }m\text{ tests} | H_0) =\) \[1 - (1-\alpha)^m =\] \[1-0.95^m = \alpha_{FWER}\]

Code
df <- data.frame(m = seq(1,200)) %>%
  mutate(y = 1-0.95^m)
ggplot(df, aes(x = m, y = y)) +
  geom_line(size = 1.5, alpha = 0.7, colour = "dodgerblue") +
  xlab("m (number of tests)") +
  ylab(expression(paste(1-0.95^m, " (FWER)")))

Multiple comparisons in genomics

Family-Wise Error Rate (FWER)

  • FWER is the probability of making at least one error when testing \(m\) tests
  • Control the FWER: limit the probability of making at least one incorrect decision

  • One example: the Bonferroni correction for \(\alpha=0.05\):

    \[\text{If } P(\text{at least one error on }m \text{ tests}) \lt \alpha\]

    \[\Rightarrow P(\text{at least one error on }m \text{ tests}) \lt \sum_{i=1}^m P(\text{error on test }i)\]

    \[\sum_{i=1}^m P(\text{error on test }i) = m\alpha_{Bon}\]

    \[\alpha_{Bon} = \frac{\alpha}{m} = \frac{0.05}{m}\]

Bonferroni correction: controlling the FWER

Can think of controlling the probability of at least one false positive in two ways:

  1. Adjust the p-values; keep same \(\alpha\):

\[p_{Bon,i}=mp_i\] \[\text{ (to be more precise: } \, p_{Bon,i}=min(mp_i, 1))\]

\[\text{Then, threshold } p_{Bon,i} \text{ at } \alpha\]

  1. Adjust the \(\alpha\) threshold; keep same p-values:

\[\alpha_{Bon}=\frac{\alpha}{m}\] \[\text{Then, threshold } p_i \text{ at } \alpha_{Bon}\]

Can we do better?

  • Bonferroni correction is very conservative (i.e. controls the FWER even lower than \(\alpha\) in many settings)

  • Several other options are better

  • For example, the Holm procedure: multiplier for p-value correction is not the same for all genes; more powerful

\[p_{Holm, 1} = mp_1\]

\[p_{Holm,2} = (m-1)p_2\] \[p_{Holm,3} = (m-2)p_3\] \[\vdots\] \[\Rightarrow FWER \le \alpha\]

How practical is the FWER in high-throughput biology?

  • Why do we care so much about making one single error??

  • One easy way to ensure no Type I errors: reject no hypotheses! 😃

    • However, then our power is zero… 😭

Important

Being overly strict about Type I error leads to greater Type II error (loss of power)

Radical idea

FWER is not that relevant in high-throughput studies

It’s OK to make multiple errors, as long as you also have some true positives!

Enter: the False Discovery Rate (FDR)

False Discovery Rate

FDR is designed to control the expected proportion of false positives (V) among all hypotheses where the null has been rejected (R)

\[FDR = E \Big[ \frac{V}{R} \Big]\]

FDR vs FPR vs FWER

  • False Discovery Rate (FDR) is the rate that significant features \((R)\) are truly null

\[FDR = E \Big[ \frac{V}{R} \Big]\]

  • False Positive Rate (FPR) is the rate that truly null features \((m_0)\) are called significant

\[FPR = E\Big[\frac{V}{m_0}\Big]\]

  • Family-Wise Error Rate (FWER) is the probability that the number of truly null features rejected \((V)\) is at least 1

\[\text{FWER } = P(V\ge1) \]

Benjamini-Hochberg FDR (BH procedure)

  • Proposed the idea of controlling FDR instead of FWER

  • Proposed a procedure for doing so

    • note that we know \(R\), but we don’t know \(V\)
  • Procedure: control FDR at level \(q\)

    1. order the raw p-values \(p_1 \le p_2 \le ...\le p_m\)

    2. find test with highest rank \(j\) such that \(p_j < \frac{jq}{m}\)

    3. declare all smaller ranks up to \(j\) significant

Controlling FDR at level \(q\) = 0.05

Rank \((j)\) P-value
1 0.0008
2 0.009
3 0.127
4 0.205
5 0.396
6 0.450
7 0.641
8 0.781
9 0.900
10 0.993

\(\text{}\)

Rank \((j)\) P-value BH-threshold: jq/m
1 0.0008 0.005
2 0.009 0.010
3 0.127 0.015
4 0.205 0.020
5 0.396 0.025
6 0.450 0.030
7 0.641 0.035
8 0.781 0.040
9 0.900 0.045
10 0.993 0.050

\(\text{}\)

Rank \((j)\) P-value BH-threshold: jq/m BH-adjusted p-value: \(\text{min}_{i\le j}(mp_{(j)}/j)\)
1 0.0008 0.005 0.0080
2 0.009 0.010 0.0450
3 0.127 0.015 0.4233
4 0.205 0.020 0.5125
5 0.396 0.025 0.7500
6 0.450 0.030 0.7500
7 0.641 0.035 0.9157
8 0.781 0.040 0.9763
9 0.900 0.045 0.9930
10 0.993 0.050 0.9930

\(\text{}\)

Rank \((j)\) P-value BH-threshold: jq/m Reject \(H_0\)?
1 0.0008 0.005 \(\checkmark\)
2 0.009 0.010 \(\checkmark\)
3 0.127 0.015
4 0.205 0.020
5 0.396 0.025
6 0.450 0.030
7 0.641 0.035
8 0.781 0.040
9 0.900 0.045
10 0.993 0.050 \(\text{ }\)

\(\text{}\)

Rank \((j)\) P-val BH-thr: jq/m Reject \(H_0\)? \(FWER < 0.05\)?
1 0.0008 0.005 \(\checkmark\) \(\checkmark\)
2 0.009 0.010 \(\checkmark\)
3 0.127 0.015
4 0.205 0.020
5 0.396 0.025
6 0.450 0.030
7 0.641 0.035
8 0.781 0.040
9 0.900 0.045
10 0.993 0.050

Where \(\alpha_{Bon}=0.05/10=0.005\)

BH FDR values given in limma by default

topTable(ebFit, coef = "genotypeNrlKO")
                  logFC  AveExpr          t      P.Value    adj.P.Val        B
1450946_at   -4.7134891 8.733103 -15.583683 5.390931e-18 2.431364e-13 28.40938
1442070_at   -0.9669400 8.268122  -9.787407 6.896220e-12 1.555132e-07 16.46086
1443457_at   -1.1530057 6.049428  -9.091412 4.947029e-11 6.029428e-07 14.68843
1422679_s_at -2.1187394 9.495435  -9.002625 6.388954e-11 6.029428e-07 14.45703
1431708_a_at -2.1610625 8.591450  -8.967950 7.062007e-11 6.029428e-07 14.36634
1433050_at   -1.5083887 8.468891  -8.923932 8.021234e-11 6.029428e-07 14.25096
1426288_at   -4.2100657 9.323796  -8.772209 1.246491e-10 8.031144e-07 13.85106
1418108_at    0.8662091 8.260647   8.537984 2.475294e-10 1.395478e-06 13.22713
1450770_at    1.3103576 7.357033   8.128847 8.332369e-10 3.747976e-06 12.11870
1457802_at   -0.8362030 7.778189  -8.117112 8.629984e-10 3.747976e-06 12.08657

Or, obtain them yourself for any vector of p-values p with p.adjust(p, method="BH"):

pvals <- topTable(ebFit, coef = "genotypeNrlKO", number = Inf)$P.Value
p.adjust(pvals, method = "BH") %>% head()
[1] 2.431364e-13 1.555132e-07 6.029428e-07 6.029428e-07 6.029428e-07
[6] 6.029428e-07

Other ways to control FDR

  • BH is just one (the first) method to control FDR

  • Since the publication of the BH method, other methods have been proposed

  • One of the most popular is Storey’s q-value

  • qvalue package implementation: provides adjusted p-values

Storey’s q-value vs BH (Conceptual)

  • Just like BH, is focused on the proportion of discoveries that are false positives

  • Conceptual difference between BH and Storey’s q-value is:

    • BH controls the FDR

    • q-values give an unbiased estimate of the FDR (will control the FDR on average)

Storey’s q-value vs BH (Mathematical)

  • Mathematically, the difference between the two is in how \(m_0\) is estimated

    • Or equivalently, how \(\pi_0=\frac{m_0}{m}\) is estimated (since \(m\) is known)

    • \(\pi_0\) represents the overall proportion of tests that are truly null

  • q-value:

\[\hat{q}(p_i) = \min_{i\le j} \Big( \frac{\hat{\pi}_0m}{j}p_{(j)}\Big)\]

  • q-value and BH-adjusted p-values are equivalent when \(\pi_0=1\)

\[\hat{p}_{BH}(p_i) = \min_{i\le j} \Big(\frac{m}{j}p_{(j)}\Big)\]

(BH conservatively assumes that \(\pi_0=1\))

BH vs q-value in toy example

Rank \((j)\) P-value \(\hat{p}_{BH}(p_i)\) \(\hat{q}(p_i)\)1
1 0.0008 0.045 0.045
2 0.009 0.045 0.045
3 0.127 0.423 0.423
4 0.205 0.513 0.513
5 0.396 0.750 0.750
6 0.450 0.750 0.750
7 0.641 0.916 0.916
8 0.781 0.976 0.976
9 0.900 0.993 0.993
10 0.993 0.993 0.993

BH vs q-value in effect of genotypeNrlKO:age

Code
# requires qvalue package
pvals <- topTable(ebFit, coef = "genotypeNrlKO:age", number = Inf)$P.Value

df <- data.frame(fdr.bh = p.adjust(pvals, method = "BH"),
                 fdr.q = qvalue::qvalue(pvals)$qvalues)

ggplot(df, aes(y = fdr.q, x = fdr.bh)) +
  geom_point(alpha = 0.1) +
  geom_abline(intercept = 0, slope = 1, color="red") +
  ggtitle("BH vs q-value for genotypeNrlKO:age") +
  xlim(0,1) +
  ylim(0,1)

Code
df <- mutate(df, 
             l_fdr.bh = -log10(fdr.bh),
             l_fdr.q = -log10(fdr.q))

ggplot(df, aes(y = l_fdr.q, x = l_fdr.bh)) +
  geom_point(alpha = 0.1) +
  geom_abline(intercept = 0, slope = 1, color="red") +
  ggtitle("BH vs q-value for genotypeNrlKO:age") +
  xlab("-log10(BH adjusted p-value)") +
  ylab("-log10(q-value)")

FDR control is an active area of research

Compounding issues of multiple comparisons

  • What if you’re not only testing 20K genes, but also multiple tests per gene (e.g. multiple contrasts, such as several two-group comparisons)?

  • Classical procedures for adjustment:

    • Tukey multiple comparison procedure
    • Scheffe multiple comparison procedure
    • Bonferroni or Holm FWER correction
  • In our setting, we can also apply BH to all p-values globally

    • limma::decideTests(pvals, method="global") for a matrix of p-values or eBayes output (e.g. rows = genes, columns = contrasts)

    • p-values are combined, adjusted globally, then separated back out and sorted

Assumptions about p-values

Note

Implicit assumption for all multiple testing correction methods: p-value distribution is “well-behaved”

What does this mean?

Primarily, that the distribution of p-values under the null is uniform (i.e. flat)

p-value distributions

Spike of small p-values indicates non-null tests.

Code
# plot histogram of p-values for genotypeNrlKO
hist(pvals)

Great primer on how things can go wrong: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/

What if p-values are poorly behaved?

  • FDR estimates can be invalid (assumptions are violated)

  • Possible solution: nonparametric test

    • Limitation: lack of flexibility to adjust for multiple covariates
  • Another possible solution: estimate sampling distribution or p-values “empirically” using resampling techniques

    • Permutation: construct a simulated version of your dataset that satisfies the null hypothesis and compute statistic (e.g. shuffle group labels for a two-group comparison); repeat many times and use permutation statistics as your sampling distribution rather than a t, Normal, F, \(\chi^2\), etc

    • Limitation: often computationally intensive for genomics; not optimal for small sample sizes