Two Sample t-test
data: expression by genotype
t = 0.52854, df = 37, p-value = 0.6003
alternative hypothesis: true difference in means between group WT and group NrlKO is not equal to 0
95 percent confidence interval:
-0.07384018 0.12595821
sample estimates:
mean in group WT mean in group NrlKO
7.765671 7.739612
Discussion recap
What test should I use?
What test(s) might be appropriate if your sample size is just barely large enough to invoke CLT, but you also have suspected outliers?
If more than one test is appropriate (e.g. t-test, Wilcoxon, and KS), which should we report?
What should you do if methods that are equally appropriate and defensible give very different answers?
What is generally more important for results interpretation: the effect size or the p-value?
Key question
Why should I bother with the assumptions of the t-test, which aren’t necessary for the Wilcoxon or KS?
Today’s Learning Objectives
Compare means of different groups (2 or more) using a linear regression model
Use ‘indicator’ variables to represent the levels of a qualitative explanatory variable
Write a linear model using matrix notation and understand which matrix is built by R
Distinguish between single and joint hypothesis tests (e.g. \(t\)-tests vs \(F\)-tests)
Two Sample t-test
data: expression by genotype
t = 0.52854, df = 37, p-value = 0.6003
alternative hypothesis: true difference in means between group WT and group NrlKO is not equal to 0
95 percent confidence interval:
-0.07384018 0.12595821
sample estimates:
mean in group WT mean in group NrlKO
7.765671 7.739612
(One-way) Analysis of Variance (ANOVA)
filter(twoGenes, gene =="Irs4") %>%aov(expression ~ genotype, data = .) %>%summary()
Df Sum Sq Mean Sq F value Pr(>F)
genotype 1 0.0066 0.006617 0.279 0.6
Residuals 37 0.8764 0.023685
Linear Regression1
filter(twoGenes, gene =="Irs4") %>%lm(expression ~ genotype, data = .) %>%summary()
Call:
lm(formula = expression ~ genotype, data = .)
Residuals:
Min 1Q Median 3Q Max
-0.36553 -0.09067 0.01724 0.09396 0.36496
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.76567 0.03441 225.659 <2e-16 ***
genotypeNrlKO -0.02606 0.04930 -0.529 0.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1539 on 37 degrees of freedom
Multiple R-squared: 0.007494, Adjusted R-squared: -0.01933
F-statistic: 0.2794 on 1 and 37 DF, p-value: 0.6003
where \(j = \textrm{\{WT, NrlKO}\}\) (or \(j=\textrm{\{1, 2}\}\) ) identifies the groups; and \(i=1, \ldots, n_j\) identifies the observations within each group
For example: \(Y_{11}\) is the first observation in group 1 or WT
where \(j\) indexes groups (e.g. WT vs NrlKO) and \(i\) indexes samples within group, the goal is to test \(H_0 : \mu_1 = \mu_2\)
Note
In the cell-means model parameterization, we have a parameter \(E[Y_{ij}] = \mu_j\) that represents the population mean of each group (in our example: genotype)
Important
We assume a common distribution \(G\) for all groups (equal variance assumption)
Why the name? ‘Cell’ here refers to a cell of a table - e.g. make a table of means by group, and \(\mu_j\) represents the population value for each cell \(j\) in the table
Recall: sample mean estimator of population mean
For each group \(j\), the population mean is given by \(E[Y_{ij}] = \mu_j\)
A natural estimator of the population mean \(\mu_j\) is the sample mean \(\hat{\mu}_j = \bar{Y_j} = \frac{\sum_{i=1}^{n_j}Y_{ij}}{n_j}\)
Recall that the t.test function calculates these for us in R
But why does the lm function report different estimates?
Two Sample t-test
data: expression by genotype
t = 0.52854, df = 37, p-value = 0.6003
alternative hypothesis: true difference in means between group WT and group NrlKO is not equal to 0
95 percent confidence interval:
-0.07384018 0.12595821
sample estimates:
mean in group WT mean in group NrlKO
7.765671 7.739612
Note that for each group, the population mean is given by \(E[Y_{ij}] = \theta+\tau_j=\mu_j,\) and \(\tau_2=\mu_2-\mu_1=E[Y_{i2}] -E[Y_{i1}]\)compares the means
\(\tau_1\) must be set to zero, since group 1 is the reference group
Note
In the reference-treatment effect model parameterization, we have the following parameters:
\(\theta\) represents the population mean of the reference group (in our example: WT)
\(\tau_j\) represents the difference in the population mean of group \(j\) compared to the reference (in our example: NrlKO - WT)
Relation between parameterizations
lm output
the sample mean of the WT group (reference): \(\hat\theta\)
the difference in sample mean of NrlKO and WT groups (treatment effect): \(\hat\tau_2\)
Note that \(Y_{i1} = \theta + \varepsilon_{i1}\), because \(x_{i1}=0\) and \(Y_{i2} = \theta + \tau+ \varepsilon_{i2}\), because \(x_{i2}=1\) (for all \(i\))
The second form is written as a linear ( \(y=a + bx +\varepsilon\) ) regression model, with a (indicator) explanatory variable \(x_{ij}\)
t-test with a linear model
Note
Using indicator variables to model our categorical variable genotype, we can perform a 2-sample t-test with a linear model
\[Y_{ij} = \theta+\tau x_{ij} + \varepsilon_{ij}\;\text{where}\; \; x_{ij}=\bigg\{\begin{array}{l}
1 \text{ if } j=2\\
0 \text{ if } j=1\\
\end{array}\]
The standalone t-test is carried out on \(H_0: \mu_1 = \mu_2\)
The t-test in the linear model is carried out on \(H_0: \tau = 0\), where \(\tau\) is the difference in population means (here NrlKO - WT)
Recall that \(\tau = \mu_2 - \mu_1\) - this is why these are equivalent tests!
Beyond 2-group comparisons
Note
Indicator variables can be used to model one or more categorical variables, each with 2 or more levels!
1-way ANOVA with many levels1 using a linear model - e.g for 3 groups: \[Y_{ij} = \theta+\tau_2 x_{ij2} + \tau_3 x_{ij3} +\varepsilon_{ij}\;\; \text{where} \; x_{ij2}=\bigg\{\begin{array}{l}
1\text{ if } j=2\\
0 \text{ otherwise}\\
\end{array}\; \text{ and } \; x_{ij3}=\bigg\{\begin{array}{l}
1\text{ if } j=3\\
0 \text{ otherwise}\\
\end{array}\]
Important
This equivalence is why R can estimate all of them with lm()
Connections
Important
The t-test is a special case of ANOVA, but with ANOVA you can compare more than two groups and more than one factor.
ANOVA is a special case of linear regression, but with linear regression you can include quantitative variables in the model.
Linear regression provides a unifying framework to model the association between a response and many quantitative and qualitative variables.
In R all three can be computed using the lm() function.
Linear models using matrix notation
It will become handy to write our model using matrix notation
Design matrix
Let’s form a design matrix\((X)\) for a 3-group comparison
F-test and overall significance of one or more coefficients
the t-test in linear regression allows us to test single hypotheses: \[H_0 : \tau_j = 0\]\[H_A : \tau_j \neq 0\]
but we often like to test multiple hypotheses simultaneously: \[H_0 : \tau_{P2} = \tau_{P6} = \tau_{P10} = \tau_{P28}=0\textrm{ [AND statement]}\]\[H_A : \tau_j \neq 0 \textrm{ for some j [OR statement]}\]
the F-test allows us to test such compound tests
more on this type of test next week
Single and joint tests in lm output
Can you locate the results of each type of test in the lm output?
\(H_0: \tau_j = 0\) vs \(H_1: \tau_j \neq 0\) for each \(j\)individually
\(H_0: \tau_j = 0\) vs \(H_1: \tau_j \neq 0\) for all \(j\)together
twoGenes %>%filter(gene =="BB114814") %>%lm(expression ~ dev_stage, data = .) %>%summary()
Call:
lm(formula = expression ~ dev_stage, data = .)
Residuals:
Min 1Q Median 3Q Max
-0.78553 -0.13324 -0.04796 0.17038 0.51846
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.5409 0.1022 54.240 < 2e-16 ***
dev_stageP2 0.3038 0.1399 2.172 0.0369 *
dev_stageP6 0.2433 0.1399 1.739 0.0911 .
dev_stageP10 0.8341 0.1399 5.963 9.62e-07 ***
dev_stageP28 3.6324 0.1399 25.967 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2703 on 34 degrees of freedom
Multiple R-squared: 0.9662, Adjusted R-squared: 0.9623
F-statistic: 243.3 on 4 and 34 DF, p-value: < 2.2e-16
To conclude - key ideas from today
We can compare group means (2 or more) using a linear model
We can use different parameterizations (cell means and reference-treatment effect) to write statistical models
We can write a linear model using matrix notation: \(Y = X \alpha + \varepsilon\)
Linear models can include quantitative & qualitative covariates
We use different tests to distinguish between single and joint hypotheses (e.g. \(t\)-tests vs \(F\)-tests)