Levi Waldron
June 6, 2018
arcsin(sqrt(x))
phyloseq
and DESeq2
packages simplify the processtable(spider$leg,spider$type)
##
## pull push
## L1 34 34
## L2 15 15
## L3 52 52
## L4 40 40
summary(spider)
## leg type friction
## L1: 68 pull:141 Min. :0.1700
## L2: 30 push:141 1st Qu.:0.3900
## L3:104 Median :0.7600
## L4: 80 Mean :0.8217
## 3rd Qu.:1.2400
## Max. :1.8400
boxplot(spider$friction ~ spider$type * spider$leg,
col=c("grey90","grey40"), las=2,
main="Friction coefficients of different leg pairs")
Notes:
The following are examples of linear models:
\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]
Random part of model:
\(y_i = E[y_i|x_i] + \epsilon_i\)
Assumptions of linear models: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
L1
:
L2
={1 if \(2^{nd}\) leg pair, 0 otherwise},L3
={1 if \(3^{nd}\) leg pair, 0 otherwise},L4
={1 if \(4^{th}\) leg pair, 0 otherwise}.aov()
, lm()
, glm()
, and coxph()
use a “model formula” interface.> response variable ~ explanatory variables
Model formula for simple linear regression:
> y ~ x
Friction coefficient for leg type of first leg pair:
spider.sub <- spider[spider$leg=="L1", ]
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)
##
## Call:
## lm(formula = friction ~ type, data = spider.sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33147 -0.10735 -0.04941 -0.00147 0.76853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92147 0.03827 24.078 < 2e-16 ***
## typepush -0.51412 0.05412 -9.499 5.7e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2232 on 66 degrees of freedom
## Multiple R-squared: 0.5776, Adjusted R-squared: 0.5711
## F-statistic: 90.23 on 1 and 66 DF, p-value: 5.698e-14
Regression coefficients for friction ~ type
for first set of spider legs:
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.9215 | 0.0383 | 24.08 | 0.0000 |
typepush | -0.5141 | 0.0541 | -9.50 | 0.0000 |
Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the ‘pull’ samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.
Remember there are positions 1-4
fit <- lm(friction ~ leg, data=spider)
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.6644 | 0.0538 | 12.34 | 0.0000 |
legL2 | 0.1719 | 0.0973 | 1.77 | 0.0784 |
legL3 | 0.1605 | 0.0693 | 2.32 | 0.0212 |
legL4 | 0.2813 | 0.0732 | 3.84 | 0.0002 |
Additional explanatory variables can be added as follows:
> y ~ x + z
Note that “+” does not have its usual meaning, which would be achieved by:
> y ~ I(x + z)
Remember there are positions 1-4
fit <- lm(friction ~ type + leg, data=spider)
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.0539 | 0.0282 | 37.43 | 0.0000 |
typepush | -0.7790 | 0.0248 | -31.38 | 0.0000 |
legL2 | 0.1719 | 0.0457 | 3.76 | 0.0002 |
legL3 | 0.1605 | 0.0325 | 4.94 | 0.0000 |
legL4 | 0.2813 | 0.0344 | 8.18 | 0.0000 |
Interaction is modeled as the product of two covariates: \[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2 \]
symbol | example | meaning |
---|---|---|
+ | + x | include this variable |
- | - x | delete this variable |
: | x : z | include the interaction |
* | x * z | include these variables and their interactions |
^ | (u + v + w)^3 | include these variables and all interactions up to three way |
1 | -1 | intercept: delete the intercept |
Note: order generally doesn’t matter (u+v OR v+u)
lm( y ~ u + v)
u
and v
factors: ANOVA
u
and v
numeric: multiple regression
one factor, one numeric: ANCOVA
Recall the multiple linear regression model:
\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)
Matrix notation for the multiple linear regression model:
\[ \, \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]
or simply:
\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]
group <- factor( c(1, 1, 2, 2) )
model.matrix(~ group)
## (Intercept) group2
## 1 1 0
## 2 1 0
## 3 1 1
## 4 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
What if we forgot to code group as a factor?
group <- c(1, 1, 2, 2)
model.matrix(~ group)
## (Intercept) group
## 1 1 1
## 2 1 1
## 3 1 2
## 4 1 2
## attr(,"assign")
## [1] 0 1
group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group)
## (Intercept) group2 group3
## 1 1 0 0
## 2 1 0 0
## 3 1 1 0
## 4 1 1 0
## 5 1 0 1
## 6 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
group <- factor(c(1,1,2,2,3,3))
group <- relevel(x=group, ref=3)
model.matrix(~ group)
## (Intercept) group1 group2
## 1 1 1 0
## 2 1 1 0
## 3 1 0 1
## 4 1 0 1
## 5 1 0 0
## 6 1 0 0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
model.matrix(~ diet + sex)
## (Intercept) diet2 sexm
## 1 1 0 0
## 2 1 0 0
## 3 1 0 1
## 4 1 0 1
## 5 1 1 0
## 6 1 1 0
## 7 1 1 1
## 8 1 1 1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
##
## attr(,"contrasts")$sex
## [1] "contr.treatment"
model.matrix(~ diet + sex + diet:sex)
## (Intercept) diet2 sexm diet2:sexm
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 1 0
## 4 1 0 1 0
## 5 1 1 0 0
## 6 1 1 0 0
## 7 1 1 1 1
## 8 1 1 1 1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
##
## attr(,"contrasts")$sex
## [1] "contr.treatment"
Systematic component is:
\[ log(E[y|x_i]) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Or equivalently: \[ E[y|x_i] = exp \left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]
where \(E[y|x_i]\) is the expected number of counts for a microbe in subject i
\(\epsilon_i\) is typically Poisson or Negative Binomal distributed.
This is a very important distinction!
If there is evidence that the fixed parameters differ between two groups of interest, we say the results are statistically significant.
Negative Binomial Distribution has two parameters: # of trials n, and probability of success p
metagenomeSeq