Components of a linear model
Deterministic component (linear predictors)
Stochastic component: (distributions)
Simple Linear models
Continuous predictors (“simple linear regression”)
Binary categorical predictor (“t-test”)
Quantifying uncertainty
confidence intervals
hypothesis testing
\[ y_i \sim Normal(\beta_0 + \beta_{1}x_i, \sigma^2)\]
\[ i = 1, ..., N\]
Response variable (y) = deterministic part + stochastic part
Stochastic = random (or unexplained information/error)
Deterministic = systematic
For all “general linear models”, the error around the deterministic part is normally distributed
Why are they called “linear” models?
Response is a function of explanatory variables whose effects are additive
You can still model nonlinear relationships (you just can’t have a parameter or independent variable in the exponent)
\[ y_i = \beta_0 + \beta_{1}x_i +
\epsilon_i \]
\[ y_i = \beta_0 + \beta_{1}x^2_i +
\epsilon_i \]
\[ y_i = \beta_{0}x^{\beta_1}_i +
\epsilon_i \]
\[ y_i = \beta_{0}exp(\beta_1, x_i) +
\epsilon_i \]
\[ y_i = \beta_0 + \beta_{1}x_i + \beta_{2}z_i + \beta_{3}x_{i}z_{i} + \epsilon_i \]
\[ y_i = \beta_0 + \beta_{1}x_i +
\epsilon_i \]
\[ y_i = \beta_0 + \beta_{1}x^2_i +
\epsilon_i \]
\[ \color{red}{y_i =
\beta_{0}x^{\beta_1}_i + \epsilon_i} \]
\[ \color{red}{y_i = \beta_{0}exp(\beta_1,
x_i) + \epsilon_i} \]
\[ y_i = \beta_0 + \beta_{1}x_i + \beta_{2}z_i + \beta_{3}x_{i}z_{i} + \epsilon_i \]
Remember, don’t worry about the specific design names!
Parametric statistical models - probability distributions
Types of responses:
Type of response will dictate this distribution
Normal Distribution
One continuous distribution that we need to understand for linear models!
Normal Distribution
Unimodal and symmetric
Central Limit Theorem
Examples:
Normal Distribution
\[ f(x | \mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
\[ y_i = \beta_0 + \beta_{1}x_i +
\epsilon_i\] \[ \epsilon_i \sim
Normal(0,\sigma^2) \]
We don’t know \(\beta_0\), \(\beta_1\) or \(\sigma^2\). How do we generate estimates for the parameters?
Frequentist: Maximum Likelihood
Bayesian: MCMC
Need to quantify uncertainty with our esitmates
-Linear model with a continuous predictor (aka “linear regression”)
-Continuous predictor (x = forest cover) and a continuous response variable (y = aboveground biomass)
\[ AGB_i = \beta_0 + \beta_{i}FOREST_i + \epsilon_i \] \[ \epsilon_i \sim Normal(0, \sigma^2) \]
## agb forest
## 1 540 95.7
## 2 429 78.4
## 3 259 53.2
## 4 487 82.1
## 5 155 35.2
## 6 387 65.4
## agb forest
## 1 540 95.7
## 2 429 78.4
## 3 259 53.2
## 4 487 82.1
## 5 155 35.2
## 6 387 65.4
540 = \(\beta_0\) * 1 + \(\beta_1\) * 95.7 + \(\epsilon_1\) 429 = \(\beta_0\) * 1 + \(\beta_1\) * 78.4 + \(\epsilon_2\) 259 = \(\beta_0\) * 1 + \(\beta_1\) * 53.2 + \(\epsilon_3\) 487 = \(\beta_0\) * 1 + \(\beta_1\) * 82.1 + \(\epsilon_4\) 155 = \(\beta_0\) * 1 + \(\beta_1\) * 35.2 + \(\epsilon_5\) 387 = \(\beta_0\) * 1 + \(\beta_1\) * 65.4 + \(\epsilon_6\)
out <- lm(agb ~ forest, data = dat)
summary(out)
##
## Call:
## lm(formula = agb ~ forest, data = dat)
##
## Residuals:
## 1 2 3 4 5 6
## -16.027 -13.327 -17.707 20.356 -3.407 30.112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -72.935 33.306 -2.19 0.093708 .
## forest 6.572 0.468 14.04 0.000149 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.81 on 4 degrees of freedom
## Multiple R-squared: 0.9801, Adjusted R-squared: 0.9752
## F-statistic: 197.2 on 1 and 4 DF, p-value: 0.0001492
Linear model with one categorical predictor with two levels (aka “t-test”)
Categorical predictor (x = land cover classification) and a continuous response (y = aboveground biomass)
\[ \mu_i = \beta_0 + \beta_{1}CLASS_1 \]
\[ AGB_i \sim Normal(\mu_i, \sigma^2) \]
## agb forest forest.class
## 1 540 95.7 Forest
## 2 429 78.4 Forest
## 3 259 53.2 Non-forest
## 4 487 82.1 Forest
## 5 155 35.2 Non-forest
## 6 387 65.4 Forest
How do we quantify a categorical predictor?
Create “dummy variables” consisting of 0’s and 1’s
The way we create dummy variables determines our interpretations
Effects parameterization
out.effects <- lm(agb ~ forest.class, data = dat)
summary(out.effects)
##
## Call:
## lm(formula = agb ~ forest.class, data = dat)
##
## Residuals:
## 1 2 3 4 5 6
## 79.25 -31.75 52.00 26.25 -52.00 -73.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 460.75 34.30 13.433 0.000178 ***
## forest.classNon-forest -253.75 59.41 -4.271 0.012939 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.6 on 4 degrees of freedom
## Multiple R-squared: 0.8202, Adjusted R-squared: 0.7752
## F-statistic: 18.24 on 1 and 4 DF, p-value: 0.01294
Means parameterization
out.means <- lm(agb ~ forest.class - 1, data = dat)
summary(out.means)
##
## Call:
## lm(formula = agb ~ forest.class - 1, data = dat)
##
## Residuals:
## 1 2 3 4 5 6
## 79.25 -31.75 52.00 26.25 -52.00 -73.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## forest.classForest 460.75 34.30 13.433 0.000178 ***
## forest.classNon-forest 207.00 48.51 4.267 0.012978 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.6 on 4 degrees of freedom
## Multiple R-squared: 0.9803, Adjusted R-squared: 0.9704
## F-statistic: 99.32 on 2 and 4 DF, p-value: 0.0003896
Three ingredients:
Estimate \(\pm t_{n-p, 1-{\alpha/2}}x\) standard error
Based on hypothetical replicate experiments
confint(out.means)
## 2.5 % 97.5 %
## forest.classForest 365.51563 555.9844
## forest.classNon-forest 72.31826 341.6817
By hand
(out.info <- summary(out.means)$coefficients)
## Estimate Std. Error t value Pr(>|t|)
## forest.classForest 460.75 34.30083 13.432620 0.0001776766
## forest.classNon-forest 207.00 48.50870 4.267276 0.0129782221
N <- nrow(dat) # Number of observations.
p <- 2 # Number of parameters.
t.val <- qt(1 - 0.05 / 2, df = N - p)
(lower.ci <- out.info[, 1] - out.info[, 2] * t.val)
## forest.classForest forest.classNon-forest
## 365.51563 72.31826
(upper.ci <- out.info[, 1] + out.info[, 2] * t.val)
## forest.classForest forest.classNon-forest
## 555.9844 341.6817
Probability of obtaining a result equal to or “more extreme” than what was observed, assuming that the null hypothesis is true
Modern statistics is leaning away from p-values. Why?
“statistically significant” is arbitrary: (read this article)
Parameter values and confidence intervals are more meaningful
Effect sizes are great if using p values
“Scientific conclusions and business or policy decisions should not be based on whether a p-value passes a specific threshold” - American Statistical Association, 2016
This week in class we will try out the lab exercises to do the following: