Levi Waldron
Textbook sources:
##
## pull push
## L1 34 34
## L2 15 15
## L3 52 52
## L4 40 40
## leg type friction
## Length:282 Length:282 Min. :0.1700
## Class :character Class :character 1st Qu.:0.3900
## Mode :character Mode :character Median :0.7600
## Mean :0.8217
## 3rd Qu.:1.2400
## Max. :1.8400
boxplot(spider$friction ~ spider$type * spider$leg,
col=c("grey90","grey40"), las=2,
main="Friction coefficients of different leg pairs")
Notes:
The following are examples of linear models:
\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]
Random part of model:
\(y_i = E[y_i|x_i] + \epsilon_i\)
Assumptions of linear models: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
L1
:
L2
={1 if \(2^{nd}\) leg pair, 0 otherwise},L3
={1 if \(3^{nd}\) leg pair, 0 otherwise},L4
={1 if \(4^{th}\) leg pair, 0 otherwise}.spider
mini-exercisesUsing the spider
object:
leg2
where L1
/ L2
are merged, and L3
/ L4
are merged (hint: see ?dplyr::recode_factor
)leg3
that is an “ordered” factor (hint: you can use recode_factor
again, or ?factor
)spider$type
(hint: see ?relevel
)aov()
, lm()
, glm()
, and coxph()
use a “model formula” interface.> response variable ~ explanatory variables
Model formula for simple linear regression:
> y ~ x
Friction coefficient for leg type of first leg pair:
suppressPackageStartupMessages(library(tidyverse))
spider.sub <- filter(spider, leg=="L1")
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)
##
## Call:
## lm(formula = friction ~ type, data = spider.sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33147 -0.10735 -0.04941 -0.00147 0.76853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92147 0.03827 24.078 < 2e-16 ***
## typepush -0.51412 0.05412 -9.499 5.7e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2232 on 66 degrees of freedom
## Multiple R-squared: 0.5776, Adjusted R-squared: 0.5711
## F-statistic: 90.23 on 1 and 66 DF, p-value: 5.698e-14
Regression coefficients for friction ~ type
for first set of spider legs:
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.921 0.0383 24.1 2.12e-34
## 2 typepush -0.514 0.0541 -9.50 5.70e-14
Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the ‘pull’ samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.
Remember there are positions 1-4
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.6644 | 0.0538 | 12.34 | 0.0000 |
legL2 | 0.1719 | 0.0973 | 1.77 | 0.0784 |
legL3 | 0.1605 | 0.0693 | 2.32 | 0.0212 |
legL4 | 0.2813 | 0.0732 | 3.84 | 0.0002 |
Additional explanatory variables can be added as follows:
> y ~ x + z
Note that “+” does not have its usual meaning, which would be achieved by:
> y ~ I(x + z)
Remember there are positions 1-4
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.0539 | 0.0282 | 37.43 | 0.0000 |
typepush | -0.7790 | 0.0248 | -31.38 | 0.0000 |
legL2 | 0.1719 | 0.0457 | 3.76 | 0.0002 |
legL3 | 0.1605 | 0.0325 | 4.94 | 0.0000 |
legL4 | 0.2813 | 0.0344 | 8.18 | 0.0000 |
Interaction is modeled as the product of two covariates: \[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2 \]
Interaction between coffee and time of day on performance
Image credit: http://personal.stevens.edu/~ysakamot/
symbol | example | meaning |
---|---|---|
+ | + x | include this variable |
- | - x | delete this variable |
: | x : z | include the interaction |
* | x * z | include these variables and their interactions |
^ | (u + v + w)^3 | include these variables and all interactions up to three way |
1 | -1 | intercept: delete the intercept |
Note: order generally doesn’t matter (u+v OR v+u)
lm( y ~ u + v)
u
and v
factors: ANOVA
u
and v
numeric: multiple regression
one factor, one numeric: ANCOVA
Perform regression on the spider dataset to model friction as a function of type, leg, and the interaction of type and leg. Interpret the output of this regression.
\[ g\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Useful for log-transformed microarray data
The model: \(y_i = E[y|x] + \epsilon_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)
Random component of \(y_i\) is normally distributed: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
Systematic component (linear predictor): \(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}\)
Link function here is the identity link: \(g(E(y | x)) = E(y | x)\). We are modeling the mean directly, no transformation.
Useful for binary outcomes, e.g. Single Nucleotide Polymorphisms or somatic variants
The model: \[ Logit(P(x)) = log \left( \frac{P(x)}{1-P(x)} \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Random component: \(y_i\) follows a Binomial distribution (outcome is a binary variable)
Systematic component: linear predictor \[ \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Link function: logit (log of the odds that the event occurs)
\[ g(P(x)) = logit(P(x)) = log\left( \frac{P(x)}{1-P(x)} \right) \]
\[ P(x) = g^{-1}\left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]
The systematic part of the GLM is:
\[ log\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + log(t_i) \]
\[ f(k, \lambda) = e^{-\lambda} \frac{\lambda^k}{k!} \]
\[
f(k, \lambda, \theta) = \frac{\Gamma(\frac{1 + \theta k}{\theta})}{k! \, \Gamma(\frac{1}{\theta})}
\left(\frac{\theta m}{1+\theta m}\right)^k
\left(1+\theta m\right)^\theta
\quad\text{for }k = 0, 1, 2, \dotsc
\] * where \(f\) is still the probability of \(k\) events (e.g. # of reads counted), * \(\lambda\) is still the mean number of events, so \(E[y|x]\) * An additional dispersion parameter \(\theta\) is estimated: + \(\theta \rightarrow 0\): Poisson distribution + \(\theta \rightarrow \infty\): Gamma distribution
* The Poisson model can be considered as nested within the Negative Binomial model + A likelihood ratio test comparing the two models is possible
dbinom()
Negative Binomial Distribution has two parameters: # of trials n, and probability of success p
Simulate the following vectors of p-values:
p1
and p2
. Does it look like there are more small p-values than expected if all null hypotheses are true?p.adjust
function to calculate Bonferroni-corrected (“bonferroni”) p-values and Benjamini Hochberg False Discovery Rate (“BH”). For each of these simulations, how many p-values are < 0.05 for the raw, Bonferroni-adjusted, and BH-adjusted values?Linear models are the basis for identifying differential expression / differential abundance
Assumptions:
Extends to binary \(Y\) (logistic regression), count \(Y\) (log-linear regression with e.g. Poisson or Negative Binomial link functions) through Generalized Linear Models
Generalized Linear Models extend linear regression to:
The basis for identifying differential expression / differential abundance