Applied Statistics for High-throughput Biology: Session 3

Levi Waldron

Outline

Textbook sources

Example: friction of spider legs

Questions

Qualitative answers

table(spider$leg,spider$type)
#>     
#>      pull push
#>   L1   34   34
#>   L2   15   15
#>   L3   52   52
#>   L4   40   40
summary(spider)
#>      leg                type              friction     
#>  Length:282         Length:282         Min.   :0.1700  
#>  Class :character   Class :character   1st Qu.:0.3900  
#>  Mode  :character   Mode  :character   Median :0.7600  
#>                                        Mean   :0.8217  
#>                                        3rd Qu.:1.2400  
#>                                        Max.   :1.8400
boxplot(spider$friction ~ spider$type * spider$leg,
        col=c("grey90","grey40"), las=2,
        main="Friction coefficients of different leg pairs")

Notes:

What are linear models?

The following are examples of linear models:

  1. \(Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\) (simple linear regression)
  2. \(Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i\) (quadratic regression)
  3. \(Y_i = \beta_0 + \beta_1 x_i + \beta_2 \times 2^{x_i} + \varepsilon_i\) (\(2^{x_i}\) is a new transformed variable)

Multiple linear regression model

\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]

Random part of model:

\(y_i = E[y_i|x_i] + \epsilon_i\)

Assumptions of linear models: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)

Continuous predictors

Binary predictors (2 levels)

Multilevel categorical predictors (ordinal or nominal)

Model formulae in R

Model formulae tutorial

> response variable ~ explanatory variables

Regression with a single predictor

Model formula for simple linear regression:

> y ~ x

Return to the spider legs

Friction coefficient for leg type of first leg pair:

spider.sub <- dplyr::filter(spider, leg=="L1")
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)
#> 
#> Call:
#> lm(formula = friction ~ type, data = spider.sub)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.33147 -0.10735 -0.04941 -0.00147  0.76853 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.92147    0.03827  24.078  < 2e-16 ***
#> typepush    -0.51412    0.05412  -9.499  5.7e-14 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.2232 on 66 degrees of freedom
#> Multiple R-squared:  0.5776, Adjusted R-squared:  0.5711 
#> F-statistic: 90.23 on 1 and 66 DF,  p-value: 5.698e-14

Regression on spider leg type

Regression coefficients for friction ~ type for first set of spider legs:

broom::tidy(fit)
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    0.921    0.0383     24.1  2.12e-34
#> 2 typepush      -0.514    0.0541     -9.50 5.70e-14

Interpretation of spider leg type coefficients

Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the 'pull' samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.

Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the ‘pull’ samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.

regression on spider leg position

Remember there are positions 1-4

fit <- lm(friction ~ leg, data=spider)
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6644 0.0538 12.34 0.0000
legL2 0.1719 0.0973 1.77 0.0784
legL3 0.1605 0.0693 2.32 0.0212
legL4 0.2813 0.0732 3.84 0.0002

Regression with multiple predictors

Additional explanatory variables can be added as follows:

> y ~ x + z

Note that “+” does not have its usual meaning, which would be achieved by:

> y ~ I(x + z)

Regression on spider leg type and position

Remember there are positions 1-4

fit <- lm(friction ~ type + leg, data=spider)
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0539 0.0282 37.43 0.0000
typepush -0.7790 0.0248 -31.38 0.0000
legL2 0.1719 0.0457 3.76 0.0002
legL3 0.1605 0.0325 4.94 0.0000
legL4 0.2813 0.0344 8.18 0.0000

Interaction (effect modification)

Interaction is modeled as the product of two covariates: \[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2 \]

Interaction between coffee and time of day on performance
Interaction between coffee and time of day on performance

Image credit: http://personal.stevens.edu/~ysakamot/

Model formulae (cont’d)

symbol example meaning
+ + x include this variable
- - x delete this variable
: x : z include the interaction
* x * z include these variables and their interactions
^ (u + v + w)^3 include these variables and all interactions up to three way
1 -1 intercept: delete the intercept

Note: order generally doesn’t matter (u+v OR v+u)

Summary: types of standard linear models

lm( y ~ u + v)

u and v factors: ANOVA
u and v numeric: multiple regression
one factor, one numeric: ANCOVA

Generalized Linear Models

Components of a GLM

\[ g\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]

Linear Regression as GLM

Logistic Regression as GLM

\[ g(P(x)) = logit(P(x)) = log\left( \frac{P(x)}{1-P(x)} \right) \]

\[ P(x) = g^{-1}\left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]

Log-linear GLM

The systematic part of the GLM is:

\[ log\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + log(t_i) \]

Poisson error model

\[ f(k, \lambda) = e^{-\lambda} \frac{\lambda^k}{k!} \]

Negative Binomial error model

\[ f(k, \lambda, \theta) = \frac{\Gamma(\frac{1 + \theta k}{\theta})}{k! \, \Gamma(\frac{1}{\theta})} \left(\frac{\theta m}{1+\theta m}\right)^k \left(1+\theta m\right)^\theta \quad\text{for }k = 0, 1, 2, \dotsc \]

Compare Poisson vs. Negative Binomial

Additive vs. multiplicative models

Inference in high dimensions (many variables)

Multiple testing

  1. Family-wise error rate (e.g., Bonferroni correction).
    • Controlling FWER at 0.05 ensures that the probably of any type I errors is < 0.05.
  2. False Discovery Rate (e.g., Benjamini-Hochberg correction)

Benjamini-Hochberg FDR algorithm

Source: MSMB Chapter 6

  1. order the p-values from \(m\) hypothesis tests in increasing order, \(p_1, \ldots, p_m\)
  2. for some choice of \(\phi\) (our target FDR), find the largest value of that \(k\) that satisfies: \(p_k \leq \phi k/m\)
  3. reject the hypotheses \(1, \ldots, k\)
Benjamini-Hochberg FDR, visuallyBenjamini-Hochberg FDR, visually

Benjamini-Hochberg FDR, visually

Important notes for intuition:

FDR alternatives to Benjamini-Hochberg

Beware of “double-dipping” in statistical inference

  1. define a separation between observations
  2. test for a difference across the separation
Example from: García-Mantrana et al., Gut Microbes 2020
Example from: García-Mantrana et al., Gut Microbes 2020

Simple example of double-dipping

Step 1: define an age classifier

Step 2: test for a difference in ages between elderly and youth

IMPORTANT: Even applying a fully-specified classifier to a validation dataset does not protect against inflated p-values from “double-dipping”

Summary

Exercises

Please discuss the following questions:

  1. What is a major problem with the hypothesis testing in 4.6 Testing for between-label differences?
    • (note, the inference problem is acknowledged in this section)
  2. What is a related problem with the hypothesis testing in 4.4 Performing the DE analysis?
  3. How might you avoid these same problems, with the same data or a multi’omic technology?