Outline

Multiple linear regression
- Continuous and categorical predictors
- Interactions
Model formulae
Design matrix
Generalized Linear Models
- Linear, logistic, log-Linear links
- Poisson, Negative Binomial error distributions
- Zero inflation

Linear modeling for metagenomic data: Two main approaches (1)

normalizing transformation, orinary linear modeling
- calculate relative abundance, dividing by the total number of counts for each sample (account for different sequencing depths)
- variance-stabilizing transformation of features, arcsin(sqrt(x))

Advantages
- simplicity: can directly use PCA, linear models, non-parametric tests
Disadvantages
- data may still not be very normally distributed
- regression coefficients for arcsin-sqrt transformed data not easily interpretable

Two main approaches (2)

treat as count data, log-linear generalized linear model (GLM)
- log-linear systematic component
- typically negative binomially-distributed random component
- model can include an “offset” term to account for different sequencing depths

Advantages
- GLM framework provides great flexibility to deal with sequencing depth, over-dispersion
- coefficients are readily interpretable in “multiplicative” models
- phyloseq and DESeq2 packages simplify the process
Disadvantages
- models are more complicated

Multiple Linear Regression Model (approach 1)

Example: friction of spider legs

Wolff & Gorb, Radial arrangement of Janus-like setae permits friction control in spiders, Sci. Rep. 2013.

(A) Barplot showing total claw tuft area of the corresponding legs.
(B) Boxplot presenting friction coefficient data illustrating median, interquartile range and extreme values.

Example: friction of spider legs

Are the pulling and pushing friction coefficients different?
Are the friction coefficients different for the different leg pairs?
Does the difference between pulling and pushing friction coefficients vary by leg pair?

Example: friction of spider legs

table(spider$leg,spider$type)

##     
##      pull push
##   L1   34   34
##   L2   15   15
##   L3   52   52
##   L4   40   40

summary(spider)

##  leg        type        friction     
##  L1: 68   pull:141   Min.   :0.1700  
##  L2: 30   push:141   1st Qu.:0.3900  
##  L3:104              Median :0.7600  
##  L4: 80              Mean   :0.8217  
##                      3rd Qu.:1.2400  
##                      Max.   :1.8400

Example: friction of spider legs

boxplot(spider$friction ~ spider$type * spider$leg,
        col=c("grey90","grey40"), las=2,
        main="Friction coefficients of different leg pairs")

Example: friction of spider legs

Notes:

Pulling friction is higher
Pulling (but not pushing) friction increases for further back legs (L1 -> 4)
Variance isn’t constant

What are linear models?

The following are examples of linear models:

\(Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\) (simple linear regression)
\(Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i\) (quadratic regression)
\(Y_i = \beta_0 + \beta_1 x_i + \beta_2 \times 2^{x_i} + \varepsilon_i\) (2^{x_i} is a new transformed variable)

Multiple linear regression model

Linear models can have any number of predictors
Systematic part of model:

\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]

\(E[y|x]\) is the expected value of \(y\) given \(x\)
\(y\) is the outcome, response, or dependent variable
\(x\) is the vector of predictors / independent variables
\(x_p\) are the individual predictors or independent variables
\(\beta_p\) are the regression coefficients

Multiple linear regression model

Random part of model:

\(y_i = E[y_i|x_i] + \epsilon_i\)

Assumptions of linear models: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)

Normal distribution
Mean zero at every value of predictors
Constant variance at every value of predictors
Values that are statistically independent

Continuous predictors

Coding: as-is, or may be scaled to unit variance (which results in adjusted regression coefficients)
Interpretation for linear regression: An increase of one unit of the predictor results in this much difference in the continuous outcome variable

Binary predictors (2 levels)

Coding: indicator or dummy variable (0-1 coding)
Interpretation for linear regression: the increase or decrease in average outcome levels in the group coded “1”, compared to the reference category (“0”)
e.g. \(E(y|x) = \beta_0 + \beta_1 x\)
where x={ 1 if push friction, 0 if pull friction }

Multilevel categorical predictors (ordinal or nominal)

Coding: \(K-1\) dummy variables for \(K\)-level categorical variable
Comparisons with respect to a reference category, e.g. L1:
- L2={1 if \(2^{nd}\) leg pair, 0 otherwise},
- L3={1 if \(3^{nd}\) leg pair, 0 otherwise},
- L4={1 if \(4^{th}\) leg pair, 0 otherwise}.
R re-codes factors to dummy variables automatically.
Dummy coding depends on whether factor is ordered or not.

Model formulae in R

Model formulae tutorial

regression functions in R such as aov(), lm(), glm(), and coxph() use a “model formula” interface.
The formula determines the model that will be built (and tested) by the R procedure. The basic format is:

> response variable ~ explanatory variables

The tilde means “is modeled by” or “is modeled as a function of.”

Regression with a single predictor

Model formula for simple linear regression:

> y ~ x

where “x” is the explanatory (independent) variable
“y” is the response (dependent) variable.

Return to the spider legs

Friction coefficient for leg type of first leg pair:

spider.sub <- spider[spider$leg=="L1", ]
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)

## 
## Call:
## lm(formula = friction ~ type, data = spider.sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33147 -0.10735 -0.04941 -0.00147  0.76853 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.92147    0.03827  24.078  < 2e-16 ***
## typepush    -0.51412    0.05412  -9.499  5.7e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2232 on 66 degrees of freedom
## Multiple R-squared:  0.5776, Adjusted R-squared:  0.5711 
## F-statistic: 90.23 on 1 and 66 DF,  p-value: 5.698e-14

Regression on spider leg type

Regression coefficients for friction ~ type for first set of spider legs:

fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.9215	0.0383	24.08	0.0000
typepush	-0.5141	0.0541	-9.50	0.0000

How to interpret this table?
- Coefficients for (Intercept) and typepush
- Coefficients are t-distributed when assumptions are correct
- Variance in the estimates of each coefficient can be calculated

Interpretation of spider leg type coefficients

Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the ‘pull’ samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.

regression on spider leg position

Remember there are positions 1-4

fit <- lm(friction ~ leg, data=spider)

fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.6644	0.0538	12.34	0.0000
legL2	0.1719	0.0973	1.77	0.0784
legL3	0.1605	0.0693	2.32	0.0212
legL4	0.2813	0.0732	3.84	0.0002

Interpretation of the dummy variables legL2, legL3, legL4 ?

Regression with multiple predictors

Additional explanatory variables can be added as follows:

> y ~ x + z

Note that “+” does not have its usual meaning, which would be achieved by:

> y ~ I(x + z)

Regression on spider leg type and position

Remember there are positions 1-4

fit <- lm(friction ~ type + leg, data=spider)

fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.0539	0.0282	37.43	0.0000
typepush	-0.7790	0.0248	-31.38	0.0000
legL2	0.1719	0.0457	3.76	0.0002
legL3	0.1605	0.0325	4.94	0.0000
legL4	0.2813	0.0344	8.18	0.0000

this model still doesn’t represent how the friction differences between different leg positions are modified by whether it is pulling or pushing

Interaction (effect modification)

Interaction is modeled as the product of two covariates: \[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2 \]

Model formulae (cont’d)

symbol	example	meaning
+	+ x	include this variable
-	- x	delete this variable
:	x : z	include the interaction
*	x * z	include these variables and their interactions
^	(u + v + w)^3	include these variables and all interactions up to three way
1	-1	intercept: delete the intercept

Note: order generally doesn’t matter (u+v OR v+u)

Summary: types of standard linear models

lm( y ~ u + v)

u and v factors: ANOVA
u and v numeric: multiple regression
one factor, one numeric: ANCOVA

R does a lot for you based on your variable classes
- be sure you know the classes of your variables
- be sure all rows of your regression output make sense

The Design Matrix

Recall the multiple linear regression model:

\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)

\(x_{ji}\) is the value of predictor \(x_j\) for observation \(i\)

The Design Matrix

Matrix notation for the multiple linear regression model:

\[ \, \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]

or simply:

\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]

The design matrix is \(\mathbf{X}\)
- which the computer will take as a given when solving for \(\boldsymbol{\beta}\) by minimizing the sum of squares of residuals \(\boldsymbol{\varepsilon}\).

Choice of design matrix

there are multiple possible and reasonable design matrices for a given study design
the model formula encodes a default model matrix, e.g.:

group <- factor( c(1, 1, 2, 2) )
model.matrix(~ group)

##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      1
## 4           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Choice of design matrix

What if we forgot to code group as a factor?

group <- c(1, 1, 2, 2)
model.matrix(~ group)

##   (Intercept) group
## 1           1     1
## 2           1     1
## 3           1     2
## 4           1     2
## attr(,"assign")
## [1] 0 1

More groups, still one variable

group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group)

##   (Intercept) group2 group3
## 1           1      0      0
## 2           1      0      0
## 3           1      1      0
## 4           1      1      0
## 5           1      0      1
## 6           1      0      1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Changing the baseline group

group <- factor(c(1,1,2,2,3,3))
group <- relevel(x=group, ref=3)
model.matrix(~ group)

##   (Intercept) group1 group2
## 1           1      1      0
## 2           1      1      0
## 3           1      0      1
## 4           1      0      1
## 5           1      0      0
## 6           1      0      0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

More than one variable

diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
model.matrix(~ diet + sex)

##   (Intercept) diet2 sexm
## 1           1     0    0
## 2           1     0    0
## 3           1     0    1
## 4           1     0    1
## 5           1     1    0
## 6           1     1    0
## 7           1     1    1
## 8           1     1    1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"

With an interaction term

model.matrix(~ diet + sex + diet:sex)

##   (Intercept) diet2 sexm diet2:sexm
## 1           1     0    0          0
## 2           1     0    0          0
## 3           1     0    1          0
## 4           1     0    1          0
## 5           1     1    0          0
## 6           1     1    0          0
## 7           1     1    1          1
## 8           1     1    1          1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"

Summary: applications of model matrices

Major differential expression packages recognize them:
- LIMMA (VOOM for RNA-seq)
- DESeq2 for all kinds of count data
- EdgeR
Can fit coefficients directly to your contrast of interest
- e.g.: what is the difference between push/pull friction for each spider-leg pair?

Generalized Linear Models (approach 2)

Generalized Linear Models

Linear regression is a special case of a broad family of models called “Generalized Linear Models” (GLM)
This unifying approach allows to fit a large set of models using maximum likelihood estimation methods (MLE) (Nelder & Wedderburn, 1972)
Can model many types of data directly using appropriate random distribution and “link” function
- Transformations of \(Y\) not needed

Components of GLM

Random component specifies the conditional distribution for the response variable
- e.g. normal, Poisson, Negative Binomial…
Systematic component specifies linear function of predictors (linear predictor)
Link [denoted by g(.)] specifies the relationship between the expected value of the random component and the systematic component
- can be linear or nonlinear

Log-linear models

Systematic component is:

\[ log(E[y|x_i]) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]

Or equivalently: \[ E[y|x_i] = exp \left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]

where \(E[y|x_i]\) is the expected number of counts for a microbe in subject i

Systematic plus random components:

\(\epsilon_i\) is typically Poisson or Negative Binomal distributed.

Note: Modeling \(log(E[y|x_i])\) is not equivalent to modeling \(E(log(y|x_i))\)

Additive vs. Multiplicative models

Linear regression is an additive model
- e.g. for two binary variables \(\beta_1 = 1.5\), \(\beta_2 = 1.5\).
- If \(x_1=1\) and \(x_2=1\), this adds 3.0 to \(E(y|x)\)
Log-linear models are multiplicative:
- If \(x_1=1\) and \(x_2=1\), this adds 3.0 to \(log(E[y_i])\)
- Expected count increases 20-fold: \(exp(1.5+1.5)\) or \(exp(1.5) * exp(1.5)\)
- Coefficients are invariant to multiplicative scaling of the data

This is a very important distinction!

Demystifying error models

In regression we model observations as coming from a random distribution with fixed parameters:
- linear regression: normal distribution, with mean and standard deviation
- log-linear models: Poisson distribution with mean \(\lambda\), or Negative Binomial distribution with parameters \(n\) and \(p\)

If there is evidence that the fixed parameters differ between two groups of interest, we say the results are statistically significant.

Poisson model

In the Poisson distribution, the variance is equal to the mean.
i.e. if the mean number of a microbe across all samples is 4, then variance is also 4 and the standard deviation is 2.
The Poisson distribution fails when the variance exceeds the mean

Visualizing the Poisson Distribution

Poisson distribution has one parameter:
- mean \(\lambda\) is greater than 0
- variance is also \(\lambda\)

Negative binomial distribution

The binomial distribution is the number of successes in n trials:
- Roll a die ten times, how many times do you see a 6?
The negative binomial distribution is the number of successes it takes to observe r failures:
- How many times do you have to roll the die to see a 6 ten times?
- Note that the number of rolls is no longer fixed.
- In this example, p=5/6 and a 6 is a “failure”

Visualizing the Negative Binomial Distribution

Compare Poisson vs. Negative Binomial

Negative Binomial Distribution has two parameters: # of trials n, and probability of success p

Zero-inflated models

Two-step model:
1. logistic model to determine whether count is zero or Poisson/NB
2. Poisson or NB regression distribution for \(y_i\) not set to zero by step 1.
Not currently supported by DESeq2, edgeR, limma (as far as I know)
- but supported by metagenomeSeq
Warning: be aware what your logistic model is
- best to keep it intercept-only

Poisson Distribution with Zero Inflation

Summary

Log-linear GLMs are preferred for inference from 16S rRNA and shotgun metagenomic data
Be aware of both models being fitted in a zero-inflated mixture / hurdle model
- a log-linear model and a logistic model
- keep the latter intercept-only unless you have good reason to do otherwise

Lecture: linear modeling for microbiome data in R/Bioconductor

Outline

Links

Linear modeling for metagenomic data: Two main approaches (1)

Two main approaches (2)

Multiple Linear Regression Model (approach 1)

Example: friction of spider legs

Example: friction of spider legs

Example: friction of spider legs

Example: friction of spider legs

Example: friction of spider legs

What are linear models?

Multiple linear regression model

Multiple linear regression model

Continuous predictors

Binary predictors (2 levels)

Multilevel categorical predictors (ordinal or nominal)

Model formulae

Model formulae in R

Regression with a single predictor

Return to the spider legs

Regression on spider leg type

Interpretation of spider leg type coefficients

regression on spider leg position

Regression with multiple predictors

Regression on spider leg type and position

Interaction (effect modification)

Model formulae (cont’d)

Summary: types of standard linear models

The Design Matrix

The Design Matrix

The Design Matrix

Choice of design matrix

Choice of design matrix

More groups, still one variable

Changing the baseline group

More than one variable

With an interaction term

Summary: applications of model matrices

Generalized Linear Models (approach 2)

Generalized Linear Models

Components of GLM

Log-linear models

Additive vs. Multiplicative models

Demystifying error models

Poisson model

Visualizing the Poisson Distribution

Negative binomial distribution

Visualizing the Negative Binomial Distribution

Compare Poisson vs. Negative Binomial

Zero-inflated models

Poisson Distribution with Zero Inflation

Summary