1 Day 2 outline

Textbook sources:

2 Introduction to Linear Models

2.1 Example: friction of spider legs

  • (A) Barplot showing total claw tuft area of the corresponding legs.
  • (B) Boxplot presenting friction coefficient data illustrating median, interquartile range and extreme values.
  • Are the pulling and pushing friction coefficients different?
  • Are the friction coefficients different for the different leg pairs?
  • Does the difference between pulling and pushing friction coefficients vary by leg pair?
table(spider$leg,spider$type)
summary(spider)
boxplot(spider$friction ~ spider$type * spider$leg,
        col=c("grey90","grey40"), las=2,
        main="Friction coefficients of different leg pairs")

Notes:

  • Pulling friction is higher
  • Pulling (but not pushing) friction increases for further back legs (L1 -> 4)
  • Variance isn’t constant

2.2 What are linear models?

The following are examples of linear models:

  1. \(Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\) (simple linear regression)
  2. \(Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i\) (quadratic regression)
  3. \(Y_i = \beta_0 + \beta_1 x_i + \beta_2 \times 2^{x_i} + \varepsilon_i\) (2^{x_i} is a new transformed variable)

2.3 Multiple linear regression model

  • Linear models can have any number of predictors
  • Systematic part of model:

\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]

  • \(E[y|x]\) is the expected value of \(y\) given \(x\)
  • \(y\) is the outcome, response, or dependent variable
  • \(x\) is the vector of predictors / independent variables
  • \(x_p\) are the individual predictors or independent variables
  • \(\beta_p\) are the regression coefficients

Random part of model:

\(y_i = E[y_i|x_i] + \epsilon_i\)

Assumptions of linear models: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)

  • Normal distribution
  • Mean zero at every value of predictors
  • Constant variance at every value of predictors
  • Values that are statistically independent

2.4 Continuous predictors

  • Coding: as-is, or may be scaled to unit variance (which results in adjusted regression coefficients)
  • Interpretation for linear regression: An increase of one unit of the predictor results in this much difference in the continuous outcome variable

2.5 Binary predictors (2 levels)

  • Coding: indicator or dummy variable (0-1 coding)
  • Interpretation for linear regression: the increase or decrease in average outcome levels in the group coded “1”, compared to the reference category (“0”)
    • e.g. \(E(y|x) = \beta_0 + \beta_1 x\)
    • where x={ 1 if push friction, 0 if pull friction }

2.6 Multilevel categorical predictors (ordinal or nominal)

  • Coding: \(K-1\) dummy variables for \(K\)-level categorical variable
  • Comparisons with respect to a reference category, e.g. L1:
    • L2={1 if \(2^{nd}\) leg pair, 0 otherwise},
    • L3={1 if \(3^{nd}\) leg pair, 0 otherwise},
    • L4={1 if \(4^{th}\) leg pair, 0 otherwise}.
  • R re-codes factors to dummy variables automatically.
  • Dummy coding depends on whether factor is ordered or not.

2.7 Mini-exercise

Using the spider object:

  1. Create a new variable leg2 where L1 / L2 are merged, and L3 / L4 are merged
  2. Create a new variable leg3 that is an “ordered” factor (see ?factor)
  3. Make “push” the reference level for spider$type (see ?relevel)

3 Model formulae in R

Model formulae tutorial

> response variable ~ explanatory variables

3.1 Regression with a single predictor

Model formula for simple linear regression:

> y ~ x

  • where “x” is the explanatory (independent) variable
  • “y” is the response (dependent) variable.

3.2 Return to the spider legs

Friction coefficient for leg type of first leg pair:

spider.sub <- spider[spider$leg=="L1", ]
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)

3.3 Regression on spider leg type

Regression coefficients for friction ~ type for first set of spider legs:

fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")

  • How to interpret this table?
    • Coefficients for (Intercept) and typepush
    • Coefficients are t-distributed when assumptions are correct
    • Variance in the estimates of each coefficient can be calculated

3.4 Interpretation of spider leg type coefficients

3.5 regression on spider leg position

Remember there are positions 1-4

fit <- lm(friction ~ leg, data=spider)
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
  • Interpretation of the dummy variables legL2, legL3, legL4 ?

3.6 Regression with multiple predictors

Additional explanatory variables can be added as follows:

> y ~ x + z

Note that “+” does not have its usual meaning, which would be achieved by:

> y ~ I(x + z)

3.7 Regression on spider leg type and position

Remember there are positions 1-4

fit <- lm(friction ~ type + leg, data=spider)
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
  • this model still doesn’t represent how the friction differences between different leg positions are modified by whether it is pulling or pushing

3.8 Interaction (effect modification)

Interaction is modeled as the product of two covariates: \[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2 \]

Interaction between coffee and time of day on performance

Interaction between coffee and time of day on performance

Image credit: http://personal.stevens.edu/~ysakamot/

3.9 Model formulae (cont’d)

symbol example meaning
+ + x include this variable
- - x delete this variable
: x : z include the interaction
* x * z include these variables and their interactions
^ (u + v + w)^3 include these variables and all interactions up to three way
1 -1 intercept: delete the intercept

Note: order generally doesn’t matter (u+v OR v+u)

3.10 Summary: types of standard linear models

lm( y ~ u + v)

u and v factors: ANOVA
u and v numeric: multiple regression
one factor, one numeric: ANCOVA

  • R does a lot for you based on your variable classes
    • be sure you know the classes of your variables
    • be sure all rows of your regression output make sense

3.11 Mini-exercise

Perform regression on the spider dataset to model friction as a function of type, leg, and the interaction of type and leg. Interpret the output of this regression.

4 Generalized Linear Models

4.1 Components of a GLM

\[ g\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]

  • Random component specifies the conditional distribution for the response variable
    • doesn’t have to be normal
    • can be any distribution in the “exponential” family of distributions
  • Systematic component specifies linear function of predictors (linear predictor)
  • Link [denoted by g(.)] specifies the relationship between the expected value of the random component and the systematic component
    • can be linear or nonlinear

4.2 Linear Regression as GLM

  • Useful for log-transformed microarray data

  • The model: \(y_i = E[y|x] + \epsilon_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)

  • Random component of \(y_i\) is normally distributed: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)

  • Systematic component (linear predictor): \(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}\)

  • Link function here is the identity link: \(g(E(y | x)) = E(y | x)\). We are modeling the mean directly, no transformation.

4.3 Logistic Regression as GLM

  • Useful for binary outcomes, e.g. Single Nucleotide Polymorphisms or somatic variants

  • The model: \[ Logit(P(x)) = log \left( \frac{P(x)}{1-P(x)} \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]

  • Random component: \(y_i\) follows a Binomial distribution (outcome is a binary variable)

  • Systematic component: linear predictor \[ \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]

  • Link function: logit (log of the odds that the event occurs)

\[ g(P(x)) = logit(P(x)) = log\left( \frac{P(x)}{1-P(x)} \right) \]

\[ P(x) = g^{-1}\left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]

4.4 Log-linear GLM

The systematic part of the GLM is:

\[ log\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + log(t_i) \]

  • Common for count data
    • can account for differences in sequencing depth by an offset \(log(t_i)\)
    • guarantees non-negative expected number of counts
    • often used in conjunction with Poisson or Negative Binomial error models

4.5 Poisson error model

\[ f(k, \lambda) = e^{-\lambda} \frac{\lambda^k}{k!} \]

  • where \(f\) is the probability of \(k\) events (e.g. # of reads counted), and
  • \(\lambda\) is the mean number of events, so \(E[y|x]\)
  • \(\lambda\) is also the variance of the number of events

4.6 Negative Binomial error model

  • aka gamma–Poisson mixture distribution

\[ f(k, \lambda, \theta) = \frac{\Gamma(\frac{1 + \theta k}{\theta})}{k! \, \Gamma(\frac{1}{\theta})} \left(\frac{\theta m}{1+\theta m}\right)^k \left(1+\theta m\right)^\theta \quad\text{for }k = 0, 1, 2, \dotsc \] * where \(f\) is still the probability of \(k\) events (e.g. # of reads counted), * \(\lambda\) is still the mean number of events, so \(E[y|x]\) * An additional dispersion parameter \(\theta\) is estimated: + \(\theta \rightarrow 0\): Poisson distribution + \(\theta \rightarrow \infty\): Gamma distribution
* The Poisson model can be considered as nested within the Negative Binomial model + A likelihood ratio test comparing the two models is possible

4.7 Compare Poisson vs. Negative Binomial

dbinom() Negative Binomial Distribution has two parameters: # of trials n, and probability of success p

4.8 Additive vs. multiplicative models

  • Linear regression is an additive model
    • e.g. for two binary variables \(\beta_1 = 1.5\), \(\beta_2 = 1.5\).
    • If \(x_1=1\) and \(x_2=1\), this adds 3.0 to \(E(y|x)\)
  • Logistic and log-linear models are multiplicative:
    • If \(x_1=1\) and \(x_2=1\), this adds 3.0 to \(log(\frac{P}{1-P})\)
    • Odds-ratio \(\frac{P}{1-P}\) increases 20-fold: \(exp(1.5+1.5)\) or \(exp(1.5) * exp(1.5)\)

5 Inference in high dimensions (many variables)

5.1 Multiple testing

  • When testing thousands of true null hypotheses with \(\alpha = 0.05\), you expect a 5% type I error rate
  • What p-values are even smaller than you expect by chance from multiple testing?
  • Two mainstream approaches for controlling type I error rate:
  1. Family-wise error rate (e.g., Bonferroni correction).
    • Controlling FWER at 0.05 ensures that the probably of any type I errors is < 0.05.
  2. False Discovery Rate (e.g., Benjamini-Hochberg correction)
    • Controlling FDR at 0.05 ensures that fraction of type I errors is < 0.05.
    • correction for the smallest p-value is the same as the Bonferroni correction
    • correction for larger p-values becomes less stringent

5.2 Mini-exercise

Simulate the following vectors of p-values:

set.seed(1)
p1 <- runif(1000)
p2 <- c(runif(900), runif(100, min=0, max=0.05))
  1. Make histograms of these p-values. Does it look like there are more small p-values than expected if all null hypotheses are true?
  2. Use the p.adjust function to calculate Bonferroni-corrected (“bonferroni”) p-values and Benjamini Hochberg False Discovery Rate (“BH”). For each of these simulations, how many p-values are < 0.05 for the raw, Bonferroni-adjusted, and BH-adjusted values?

6 Summary