Outline

  1. Components of a linear model

    • Deterministic component (linear predictors)

    • Stochastic component: (distributions)

  2. Simple Linear models

    • Continuous predictors (“simple linear regression”)

    • Binary categorical predictor (“t-test”)

    • Quantifying uncertainty

      • confidence intervals

      • hypothesis testing

What is this model??

\[ y_i = \beta_0 + \beta_{1}x_i + \epsilon_i\] \[ \epsilon_i \sim Normal(0,\sigma^2) \] \[ i = 1, ..., N \]
vs


\[ y_i \sim Normal(\beta_0 + \beta_{1}x_i, \sigma^2)\]

\[ i = 1, ..., N\]

General Linear Models

Response variable (y) = deterministic part + stochastic part

For all “general linear models”, the error around the deterministic part is normally distributed

Why are they called “linear” models?

Which are linear?

\[ y_i = \beta_0 + \beta_{1}x_i + \epsilon_i \]

\[ y_i = \beta_0 + \beta_{1}x^2_i + \epsilon_i \]

\[ y_i = \beta_{0}x^{\beta_1}_i + \epsilon_i \]

\[ y_i = \beta_{0}exp(\beta_1, x_i) + \epsilon_i \]

\[ y_i = \beta_0 + \beta_{1}x_i + \beta_{2}z_i + \beta_{3}x_{i}z_{i} + \epsilon_i \]

Which are linear?

\[ y_i = \beta_0 + \beta_{1}x_i + \epsilon_i \]

\[ y_i = \beta_0 + \beta_{1}x^2_i + \epsilon_i \]

\[ \color{red}{y_i = \beta_{0}x^{\beta_1}_i + \epsilon_i} \]

\[ \color{red}{y_i = \beta_{0}exp(\beta_1, x_i) + \epsilon_i} \]

\[ y_i = \beta_0 + \beta_{1}x_i + \beta_{2}z_i + \beta_{3}x_{i}z_{i} + \epsilon_i \]

GLMs

Remember, don’t worry about the specific design names!

Linear Models: stochastic part

Parametric statistical models - probability distributions

Linear Models: stochastic part

Normal Distribution

One continuous distribution that we need to understand for linear models!

Linear Models: stochastic part

Normal Distribution

  • Unimodal and symmetric

  • Central Limit Theorem

    • Mean of many independent variables approximates a normal
  • Examples:

    • Height
    • Biomass of trees in a forest
    • Wingspans of dragonflies




Linear Models: stochastic part

Normal Distribution

\[ f(x | \mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

Estimating the parameters

\[ y_i = \beta_0 + \beta_{1}x_i + \epsilon_i\] \[ \epsilon_i \sim Normal(0,\sigma^2) \]

We don’t know \(\beta_0\), \(\beta_1\) or \(\sigma^2\). How do we generate estimates for the parameters?

Example: Continuous predictor

-Linear model with a continuous predictor (aka “linear regression”)

-Continuous predictor (x = forest cover) and a continuous response variable (y = aboveground biomass)

Example: Continuous predictor

\[ \mu_i = \beta_0 + \beta_{i}FOREST_i \] \[ AGB_i \sim Normal(\mu_i, \sigma^2) \]
or


\[ AGB_i = \beta_0 + \beta_{i}FOREST_i + \epsilon_i \] \[ \epsilon_i \sim Normal(0, \sigma^2) \]

Data

##   agb forest
## 1 540   95.7
## 2 429   78.4
## 3 259   53.2
## 4 487   82.1
## 5 155   35.2
## 6 387   65.4

Data

\[ AGB_i = \beta_0 + \beta_{i}FOREST_i + \epsilon_i \]


##   agb forest
## 1 540   95.7
## 2 429   78.4
## 3 259   53.2
## 4 487   82.1
## 5 155   35.2
## 6 387   65.4


540 = \(\beta_0\) * 1 + \(\beta_1\) * 95.7 + \(\epsilon_1\) 429 = \(\beta_0\) * 1 + \(\beta_1\) * 78.4 + \(\epsilon_2\) 259 = \(\beta_0\) * 1 + \(\beta_1\) * 53.2 + \(\epsilon_3\) 487 = \(\beta_0\) * 1 + \(\beta_1\) * 82.1 + \(\epsilon_4\) 155 = \(\beta_0\) * 1 + \(\beta_1\) * 35.2 + \(\epsilon_5\) 387 = \(\beta_0\) * 1 + \(\beta_1\) * 65.4 + \(\epsilon_6\)

Analysis

out <- lm(agb ~ forest, data = dat)
summary(out)
## 
## Call:
## lm(formula = agb ~ forest, data = dat)
## 
## Residuals:
##       1       2       3       4       5       6 
## -16.027 -13.327 -17.707  20.356  -3.407  30.112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -72.935     33.306   -2.19 0.093708 .  
## forest         6.572      0.468   14.04 0.000149 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.81 on 4 degrees of freedom
## Multiple R-squared:  0.9801, Adjusted R-squared:  0.9752 
## F-statistic: 197.2 on 1 and 4 DF,  p-value: 0.0001492

Plot the data

Assumptions of the simple linear model

  1. The predictor \(x_1\) is linearly related to the response variable \(y_1\)
  2. The errors (residuals), \(\epsilon_i\), are independent and identially distributed
  3. The errors, \(\epsilon_i\), have a constant variance \(\sigma^2\)
  4. If performing inference, the errors, \(\epsilon_i\), are normally distributed

Residual diagnostics

Residual diagnostics examples

Residual diagnostics examples

Residual diagnostics

Example: Categorical predictor

\[ \mu_i = \beta_0 + \beta_{1}CLASS_1 \]

\[ AGB_i \sim Normal(\mu_i, \sigma^2) \]

Data

##   agb forest forest.class
## 1 540   95.7       Forest
## 2 429   78.4       Forest
## 3 259   53.2   Non-forest
## 4 487   82.1       Forest
## 5 155   35.2   Non-forest
## 6 387   65.4       Forest

Figure

Categorical Predictors: Parameterizations

Effects Parameterizations


  • \(\beta_1\) is the difference between group 1 (the “baseline” group) and group 2
  • Interpretation of \(\beta_1\) is dependent on the baseline group
  • Ex: compare fledgling survival in non-protected vs protected areas

Analysis

Effects parameterization

out.effects <- lm(agb ~ forest.class, data = dat)
summary(out.effects)
## 
## Call:
## lm(formula = agb ~ forest.class, data = dat)
## 
## Residuals:
##      1      2      3      4      5      6 
##  79.25 -31.75  52.00  26.25 -52.00 -73.75 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              460.75      34.30  13.433 0.000178 ***
## forest.classNon-forest  -253.75      59.41  -4.271 0.012939 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.6 on 4 degrees of freedom
## Multiple R-squared:  0.8202, Adjusted R-squared:  0.7752 
## F-statistic: 18.24 on 1 and 4 DF,  p-value: 0.01294

Figure

Means Parameterizations



  • All coefficients represent a mean of a group
  • \(\beta_0\) and \(\beta_1\) are group means for the first and second group, respectively
  • Don’t need to specify a baseline group
  • Ex: compare fledgling survival in protected vs non-protected areas
  • Often not for hypothesis testing

Analysis

Means parameterization

out.means <- lm(agb ~ forest.class - 1, data = dat)
summary(out.means)
## 
## Call:
## lm(formula = agb ~ forest.class - 1, data = dat)
## 
## Residuals:
##      1      2      3      4      5      6 
##  79.25 -31.75  52.00  26.25 -52.00 -73.75 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## forest.classForest       460.75      34.30  13.433 0.000178 ***
## forest.classNon-forest   207.00      48.51   4.267 0.012978 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.6 on 4 degrees of freedom
## Multiple R-squared:  0.9803, Adjusted R-squared:  0.9704 
## F-statistic: 99.32 on 2 and 4 DF,  p-value: 0.0003896

Figure

Confidence Intervals

  • Three ingredients:

    • parameter estimate
    • standard error
    • t value
  • Estimate \(\pm t_{n-p, 1-{\alpha/2}}x\) standard error

  • Based on hypothetical replicate experiments



Example: Confidence Intervals

confint(out.means)
##                            2.5 %   97.5 %
## forest.classForest     365.51563 555.9844
## forest.classNon-forest  72.31826 341.6817

Example: Confidence Intervals

By hand

(out.info <- summary(out.means)$coefficients)
##                        Estimate Std. Error   t value     Pr(>|t|)
## forest.classForest       460.75   34.30083 13.432620 0.0001776766
## forest.classNon-forest   207.00   48.50870  4.267276 0.0129782221
N <- nrow(dat) # Number of observations.
p <- 2 # Number of parameters.
t.val <- qt(1 - 0.05 / 2, df = N - p)
(lower.ci <- out.info[, 1] - out.info[, 2] * t.val)
##     forest.classForest forest.classNon-forest 
##              365.51563               72.31826
(upper.ci <- out.info[, 1] + out.info[, 2] * t.val)
##     forest.classForest forest.classNon-forest 
##               555.9844               341.6817

P values

In Class Lab

This week in class we will try out the lab exercises to do the following:

  1. Simple linear model with a continuous predictor
  2. Simple linear model with a two-level categorical predictor