02_1 General Linear Models

Outline

Components of a linear model
- Deterministic component (linear predictors)
- Stochastic component: (distributions)
Simple Linear models
- Continuous predictors (“simple linear regression”)
- Binary categorical predictor (“t-test”)
- Quantifying uncertainty
  - confidence intervals
  - hypothesis testing

What is this model??

\[ y_i = \beta_0 + \beta_{1}x_i + \epsilon_i\] \[ \epsilon_i \sim Normal(0,\sigma^2) \] \[ i = 1, ..., N \]
vs

\[ y_i \sim Normal(\beta_0 + \beta_{1}x_i, \sigma^2)\]

\[ i = 1, ..., N\]

General Linear Models

Response variable (y) = deterministic part + stochastic part

Stochastic = random (or unexplained information/error)
Deterministic = systematic

For all “general linear models”, the error around the deterministic part is normally distributed

Why are they called “linear” models?

Response is a function of explanatory variables whose effects are additive
You can still model nonlinear relationships (you just can’t have a parameter or independent variable in the exponent)

Which are linear?

\[ y_i = \beta_0 + \beta_{1}x_i + \epsilon_i \]

\[ y_i = \beta_0 + \beta_{1}x^2_i + \epsilon_i \]

\[ y_i = \beta_{0}x^{\beta_1}_i + \epsilon_i \]

\[ y_i = \beta_{0}exp(\beta_1, x_i) + \epsilon_i \]

\[ y_i = \beta_0 + \beta_{1}x_i + \beta_{2}z_i + \beta_{3}x_{i}z_{i} + \epsilon_i \]

Which are linear?

\[ y_i = \beta_0 + \beta_{1}x_i + \epsilon_i \]

\[ y_i = \beta_0 + \beta_{1}x^2_i + \epsilon_i \]

\[ \color{red}{y_i = \beta_{0}x^{\beta_1}_i + \epsilon_i} \]

\[ \color{red}{y_i = \beta_{0}exp(\beta_1, x_i) + \epsilon_i} \]

\[ y_i = \beta_0 + \beta_{1}x_i + \beta_{2}z_i + \beta_{3}x_{i}z_{i} + \epsilon_i \]

GLMs

Remember, don’t worry about the specific design names!

Linear Models: stochastic part

Parametric statistical models - probability distributions

Types of responses:
- binary (heads/tails; dead/survived)
- categorical (nationality, geographic location)
- counts (number of seedlings in a quadrant)
- continuous (body mass; wing length)
Type of response will dictate this distribution

Linear Models: stochastic part

Normal Distribution

One continuous distribution that we need to understand for linear models!

Denoted \(N(\mu, \sigma^2)\)
Mean: \(\mu \in \mathbf{R}\)
Variance: \(\sigma^2 >0\)
Support: \(\mathbf{x} \in \mathbf{R}\)

Linear Models: stochastic part

Normal Distribution

Unimodal and symmetric
Central Limit Theorem
- Mean of many independent variables approximates a normal
Examples:
- Height
- Biomass of trees in a forest
- Wingspans of dragonflies

Linear Models: stochastic part

Normal Distribution

\[ f(x | \mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

Estimating the parameters

\[ y_i = \beta_0 + \beta_{1}x_i + \epsilon_i\] \[ \epsilon_i \sim Normal(0,\sigma^2) \]

We don’t know \(\beta_0\), \(\beta_1\) or \(\sigma^2\). How do we generate estimates for the parameters?

Frequentist: Maximum Likelihood
Bayesian: MCMC
Need to quantify uncertainty with our esitmates
- confidence intervals
- hypothesis testing

Example: Continuous predictor

-Linear model with a continuous predictor (aka “linear regression”)

-Continuous predictor (x = forest cover) and a continuous response variable (y = aboveground biomass)

Example: Continuous predictor

\[ \mu_i = \beta_0 + \beta_{i}FOREST_i \] \[ AGB_i \sim Normal(\mu_i, \sigma^2) \]
or

\[ AGB_i = \beta_0 + \beta_{i}FOREST_i + \epsilon_i \] \[ \epsilon_i \sim Normal(0, \sigma^2) \]

Data

##   agb forest
## 1 540   95.7
## 2 429   78.4
## 3 259   53.2
## 4 487   82.1
## 5 155   35.2
## 6 387   65.4

Data

\[ AGB_i = \beta_0 + \beta_{i}FOREST_i + \epsilon_i \]

##   agb forest
## 1 540   95.7
## 2 429   78.4
## 3 259   53.2
## 4 487   82.1
## 5 155   35.2
## 6 387   65.4

540 = \(\beta_0\) * 1 + \(\beta_1\) * 95.7 + \(\epsilon_1\) 429 = \(\beta_0\) * 1 + \(\beta_1\) * 78.4 + \(\epsilon_2\) 259 = \(\beta_0\) * 1 + \(\beta_1\) * 53.2 + \(\epsilon_3\) 487 = \(\beta_0\) * 1 + \(\beta_1\) * 82.1 + \(\epsilon_4\) 155 = \(\beta_0\) * 1 + \(\beta_1\) * 35.2 + \(\epsilon_5\) 387 = \(\beta_0\) * 1 + \(\beta_1\) * 65.4 + \(\epsilon_6\)

Analysis

out <- lm(agb ~ forest, data = dat)
summary(out)

## 
## Call:
## lm(formula = agb ~ forest, data = dat)
## 
## Residuals:
##       1       2       3       4       5       6 
## -16.027 -13.327 -17.707  20.356  -3.407  30.112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -72.935     33.306   -2.19 0.093708 .  
## forest         6.572      0.468   14.04 0.000149 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.81 on 4 degrees of freedom
## Multiple R-squared:  0.9801, Adjusted R-squared:  0.9752 
## F-statistic: 197.2 on 1 and 4 DF,  p-value: 0.0001492

Plot the data

Assumptions of the simple linear model

The predictor \(x_1\) is linearly related to the response variable \(y_1\)
The errors (residuals), \(\epsilon_i\), are independent and identially distributed
The errors, \(\epsilon_i\), have a constant variance \(\sigma^2\)
If performing inference, the errors, \(\epsilon_i\), are normally distributed

Residual diagnostics

Assess wheter our model/data meets the assumptions
Numerical approaches exist, but we’ll use graphical ones
Plot residuals vs model fitted (predicted) values: you are looking for it to show no clear or obvious pattern
Normal probability plot should fall along a straight diagonal line

Residual diagnostics examples

Residual diagnostics

Example: Categorical predictor

Linear model with one categorical predictor with two levels (aka “t-test”)
Categorical predictor (x = land cover classification) and a continuous response (y = aboveground biomass)

\[ \mu_i = \beta_0 + \beta_{1}CLASS_1 \]

\[ AGB_i \sim Normal(\mu_i, \sigma^2) \]

Data

##   agb forest forest.class
## 1 540   95.7       Forest
## 2 429   78.4       Forest
## 3 259   53.2   Non-forest
## 4 487   82.1       Forest
## 5 155   35.2   Non-forest
## 6 387   65.4       Forest

Figure

Categorical Predictors: Parameterizations

How do we quantify a categorical predictor?
Create “dummy variables” consisting of 0’s and 1’s
The way we create dummy variables determines our interpretations
- These are called parameterizations

Effects Parameterizations

\(\beta_1\) is the difference between group 1 (the “baseline” group) and group 2
Interpretation of \(\beta_1\) is dependent on the baseline group
Ex: compare fledgling survival in non-protected vs protected areas

Analysis

Effects parameterization

out.effects <- lm(agb ~ forest.class, data = dat)
summary(out.effects)

## 
## Call:
## lm(formula = agb ~ forest.class, data = dat)
## 
## Residuals:
##      1      2      3      4      5      6 
##  79.25 -31.75  52.00  26.25 -52.00 -73.75 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              460.75      34.30  13.433 0.000178 ***
## forest.classNon-forest  -253.75      59.41  -4.271 0.012939 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.6 on 4 degrees of freedom
## Multiple R-squared:  0.8202, Adjusted R-squared:  0.7752 
## F-statistic: 18.24 on 1 and 4 DF,  p-value: 0.01294

Figure

Means Parameterizations

All coefficients represent a mean of a group
\(\beta_0\) and \(\beta_1\) are group means for the first and second group, respectively
Don’t need to specify a baseline group
Ex: compare fledgling survival in protected vs non-protected areas
Often not for hypothesis testing

Analysis

Means parameterization

out.means <- lm(agb ~ forest.class - 1, data = dat)
summary(out.means)

## 
## Call:
## lm(formula = agb ~ forest.class - 1, data = dat)
## 
## Residuals:
##      1      2      3      4      5      6 
##  79.25 -31.75  52.00  26.25 -52.00 -73.75 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## forest.classForest       460.75      34.30  13.433 0.000178 ***
## forest.classNon-forest   207.00      48.51   4.267 0.012978 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.6 on 4 degrees of freedom
## Multiple R-squared:  0.9803, Adjusted R-squared:  0.9704 
## F-statistic: 99.32 on 2 and 4 DF,  p-value: 0.0003896

Figure

Confidence Intervals

Three ingredients:
- parameter estimate
- standard error
- t value
Estimate \(\pm t_{n-p, 1-{\alpha/2}}x\) standard error
Based on hypothetical replicate experiments

Example: Confidence Intervals

confint(out.means)

##                            2.5 %   97.5 %
## forest.classForest     365.51563 555.9844
## forest.classNon-forest  72.31826 341.6817

Example: Confidence Intervals

By hand

(out.info <- summary(out.means)$coefficients)

##                        Estimate Std. Error   t value     Pr(>|t|)
## forest.classForest       460.75   34.30083 13.432620 0.0001776766
## forest.classNon-forest   207.00   48.50870  4.267276 0.0129782221

N <- nrow(dat) # Number of observations.
p <- 2 # Number of parameters.
t.val <- qt(1 - 0.05 / 2, df = N - p)
(lower.ci <- out.info[, 1] - out.info[, 2] * t.val)

##     forest.classForest forest.classNon-forest 
##              365.51563               72.31826

(upper.ci <- out.info[, 1] + out.info[, 2] * t.val)

##     forest.classForest forest.classNon-forest 
##               555.9844               341.6817

P values

Probability of obtaining a result equal to or “more extreme” than what was observed, assuming that the null hypothesis is true
Modern statistics is leaning away from p-values. Why?
- “statistically significant” is arbitrary: (read this article)
- Parameter values and confidence intervals are more meaningful
- Effect sizes are great if using p values
- “Scientific conclusions and business or policy decisions should not be based on whether a p-value passes a specific threshold” - American Statistical Association, 2016

In Class Lab

This week in class we will try out the lab exercises to do the following:

Simple linear model with a continuous predictor
Simple linear model with a two-level categorical predictor

02_1 General Linear Models

IBio/PBio/Ent 831

Lauren Sullivan

Outline

What is this model??

General Linear Models

Which are linear?

Which are linear?

GLMs

Linear Models: stochastic part

Linear Models: stochastic part

Linear Models: stochastic part

Linear Models: stochastic part

Estimating the parameters

Example: Continuous predictor

Example: Continuous predictor

Data

Data

Analysis

Plot the data

Assumptions of the simple linear model

Residual diagnostics

Residual diagnostics examples

Residual diagnostics examples

Residual diagnostics

Example: Categorical predictor

Data

Figure

Categorical Predictors: Parameterizations

Effects Parameterizations

Analysis

Figure

Means Parameterizations

Analysis

Figure

Confidence Intervals

Example: Confidence Intervals

Example: Confidence Intervals

P values

In Class Lab