What does regression do?

INFERENCE: is body size significantly related to % milk fat?
PREDICTION: ie, given the body size of an extinct animal, what do we …
- predict its % milk fat to be?
- how certain could we be in this prediction?
MODEL SELECTION: what’s the best predictor of milk fat?
- body size?
- litter size?
- lactation duration?

Steps in regression

Standard approach
- Model fitting: least squares
- Inference: Frequentist
- (In particular: “NHST” w/alpha = 0.05)
- (=“Null hypothesis significance testing”)
More advanced models:
- Model fitting: “GLS”
- “generalized least square”
- gls() function, nlme package
- can relax standard regression assumptions
- This tool isn’t used by bio/ecologists much, but should be!
Advanced models: Likelihood
- For “GLMs”: generalized linear models
- i.e. logistic regression
- repeated measures, mixed effects, random effects often fit with likelihood methods
Very Complex models: Bayes w/ MCMC
- very complex/hiearchical or multilevel models
- MCMC: markov chain monte carlo

Aide: what do I mean by “multilevel” or “hiearchical”

What is it?

Many names for same / similar issue
- repeated measures
- blocking
- random effects / mixed effects models
- multilevel models
- hiearchical models
Measurements on same thing multiple times or on similar experimental units are likely to be more similar to each other than random
Violates the assumption of independence in regression/ANOVA

Multilevel Examples:

Education studies
- students in class rooms in schools in districts in states
- (multiple students per class, multiples classes per school, multiple schools per district)
Health care studies
- patients on hospital floors in hospitals in cities
Animal studies
- repeated measures on mice in cages
- (multiple measurements per mice, multiple mice per cage)

Regression models in R using lm()

Basic R regression

“lm()” = “linear model”"
regression, t.test, ANOVA, ANCOVA are all linear models
all are fit with lm() function
(Welch’s correct only done with t.test; similar effect with gls())
“gls()” = generalized least squares
“glm()” = generalized linear model

lm.mass <- lm(fat.percent ~ log(mass.female),
             data = milk3)

Model summary

Look at model w/ summary()
(we can get just the coefficients/slopes with coef() )
(anoter useful function is arm::display; has shorter output)
(recall that you can save these summary outputs and access them as lsits with a $)

summary(lm.mass)

## 
## Call:
## lm(formula = fat.percent ~ log(mass.female), data = milk3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8.34  -4.18  -1.05   1.22  15.74 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        20.515      3.249    6.31  1.7e-07 ***
## log(mass.female)   -1.752      0.457   -3.83  0.00044 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.96 on 40 degrees of freedom
## Multiple R-squared:  0.268,  Adjusted R-squared:  0.25 
## F-statistic: 14.7 on 1 and 40 DF,  p-value: 0.000443

We’ll unpack what each of these is

Plot model w/ sumamry output

Plot model w/ equations

Add location of intercept

Coefficient plot

What is this?

arm::coefplot()
NOTE: by convention, axes are flipped!

Coefficient plot

What is this?

arm::coefplot()
NOTE: by convention, axes are flipped!

Compare coefficient plot to raw data

par(mfrow = c(1,2), mar = c(5,2,7,0.25))

arm::coefplot(lm.mass,intercept = T,
              mar=c(5,6,8,1),
              cex.pts = 2,
              col.pts = 1:2,
              pch.pts = 16:17,
              lwd = 3)
mtext(text = "y  = m*x + b",
      side = 1,adj= 1, line = -2)
mtext(text = "fat = slope*log(mass) + intercept",side = 1,adj= 1, line = -1)
make.plot()

y. <- coef(lm.mass)[1]
abline(v = 0, col = 2,lwd = 2)
arrows(x0 = 1,x1=0,
       col = 2,lwd = 2,
       y0 = y.)
lab. <- round(coef(lm.mass)[1],2)
text(x = 1.71,y = y.,
     label = lab.)

Linear regression as model fitting

Fit 2 models

Null model (Ho): flat line

Null hypothesis: ????
Null model: “fat.percent ~ 1”
“~1” means fit model
- w/ slope = 0
- intercept = mean of response variable

lm.null <- lm(fat.percent ~ 1,
              data = milk3)

Alternative model: fat.percent ~ log(mass.female)


lm.mass <- lm(fat.percent ~ log(mass.female),
              data = milk3)

Plot both models

What is the intercep of null model?

The mean fat value is 8.6

summary(milk3[,"fat.percent"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.70    3.75    6.60    8.58   11.90   27.00

Test model: significance test

Nested Models

Ha: y ~ -1.75log(mass) + 20.5 ### Ho: y ~ 0log(mass) + 8.6

Ha has TWO parameters (aka coefficients, betas)
- Intercept = 20.515
- Slope = -1.75
Ho has ONE parameter
- Intercept = 8.6
- (Slope = 0, so it doesn’t count)
Hypothesis can be formulated of 2 “nested models”
- Applies to t-test, ANOVA, ANCOVA
- A major difference between Hypothesis testing w/p-values and IT-AIC is that AIC doesn’t require models to be nested!

Hypothesis test of nested models in R

anova()
carries out an F test for linear models
“likelihood ratio test” for GLMs/GLMMs
- GLMM = generalized linear mixed models
- Models w/random effects, blocking, repeated measures etc

#NOTE: order w/in anova() doesn't matter
anova(lm.null, 
      lm.mass)

## Analysis of Variance Table
## 
## Model 1: fat.percent ~ 1
## Model 2: fat.percent ~ log(mass.female)
##   Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
## 1     41 1943                              
## 2     40 1422  1       521 14.7 0.00044 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Output of anova() is similar to content of summary()
I like to emphasize the testing of nested models
GLMs, GLMMs often can only be tested with anova()

Examine model: IT-AIC

IT-AIC

“Information Theory- Akaike’s Information Criteria”
“Information Theory” is a general approach to investiting models
AIC is a particular information theory
- Others: BIC, WAIC, FIC, DIC (Bayesian), …
- seems like every statistican has one now
AICc: correction for small samples size
- Should always use AICc
qAIC: corretion for “overdispersion”
- For GLMs (ie. logistic regression)

AIC in base R

Note: no AICc command in base R!
Other packages can do AICc and other ICs
DF = number of parameters

AIC(lm.null, 
    lm.mass)

##         df AIC
## lm.null  2 284
## lm.mass  3 273

AIC table

Lower is “better”
Absoluate values have NO MEANSING
AICs are only interpretable relative to other AIC
focus: “delta AIC” (dAIC)
AIC goes down if a model fits the data better
BUT - every additional parameter has a “penalty”

library(bbmle)

AICtab(lm.null,
       lm.mass,
       base = T)

##         AIC   dAIC  df
## lm.mass 273.1   0.0 3 
## lm.null 284.2  11.1 2

AICc

AICc is a correction for small sample size
Doesn’t make much difference here

library(bbmle)

ICtab(lm.null,
      lm.mass,
      type = c("AICc"),
      base = T)

##         AICc  dAICc df
## lm.mass 273.7   0.0 3 
## lm.null 284.5  10.8 2

What about R^2?

Hypothesis testing:

p-values: tell if you if a predictor is “significantly” associated w/ a respons
Does not tell you the EFFECT SIZE is large
- p can be very very small
- but slope also very small

IC-AIC

AIC tells you if one model fits the data better than another
AIC doesn’t tell you if either model data very well

R^2

Tells you how much variation in the data is explained by model
whether you use p-values or AIC, you should report R^2

Null vs Alternative model

Slope of Alt models is highly significant
But R2 isn’t huge
Lots of scatter around line
Other predictors might improve fit

Regression Introduction

brouwern@gmail.com

January 30, 2017

What does regression do?

Steps in regression

Aide: what do I mean by “multilevel” or “hiearchical”

Regression models in R using lm()

Basic R regression

Model summary

Plot model w/ sumamry output

Plot model w/ equations

Add location of intercept

Coefficient plot

What is this?

Coefficient plot

What is this?

Compare coefficient plot to raw data

Linear regression as model fitting

Fit 2 models

Null model (Ho): flat line

Alternative model: fat.percent ~ log(mass.female)

Plot both models

What is the intercep of null model?

Test model: significance test

Nested Models

Ha: y ~ -1.75log(mass) + 20.5 ### Ho: y ~ 0log(mass) + 8.6

Hypothesis test of nested models in R

Examine model: IT-AIC

IT-AIC

AIC in base R

AIC table

AICc

What about R^2?

Hypothesis testing:

IC-AIC

R^2

Null vs Alternative model