The goal of this lab is to * Review assumptions of ANOVA/t-tests/regression * Review diagnostics plots for normality and constant variance * Introduce diagnostics plots for outliers * Investigate the role of transformation on diagnostic plots * Introduce how to model curved (non-linear) data with x^2 terms * BONUS: material on R^2 for tomorrow’s lecture occurs at the end
Source Data
Data on the relationship between the amount of pigment on lion snouts and their age from Whitman et al (2004). These data are featured in Chapter 17 of Whitlock & Shulter’s Analysis of Biological Data, 2nd ed.
The original data was presented in Figure 4, pg 2, of Whitman
References: Whitman, K, AM Starfield, HS Quadling and C Packer. 2004. Sustainable trophy hunting of African lions. Nature.
#The following sets up the data fro the analysis
#Set working directory
setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lecture/Unit3_regression/last_week")
Load data
dat <- read.csv("lion_age_by_pop.csv")
plot(age.years ~ portion.black,
data = dat)
plot(age.years ~ portion.black,
data = dat,
col = 2)
plot(age.years ~ portion.black,
data = dat,
col = 2,
pch = 2)
m1 <- lm(age.years ~ portion.black, data = dat)
summary(m1)
##
## Call:
## lm(formula = age.years ~ portion.black, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5457 -1.1457 -0.3384 0.9245 4.3426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9262 0.5591 1.657 0.108
## portion.black 10.5827 1.4884 7.110 6.59e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.66 on 30 degrees of freedom
## Multiple R-squared: 0.6276, Adjusted R-squared: 0.6152
## F-statistic: 50.55 on 1 and 30 DF, p-value: 6.59e-08
Look at the model output and write down *___transformed model’s “word” equation *___model’s numeric equation *___p-value *___whether it is significant relative alpha = 0.05 *___R2 (also written R^2)
R2 can be thought of as * How well the model fits the data * How much of the data the model explains
#plot data
plot(age.years ~ portion.black,
data = dat,
col = 2,
pch = 2)
#add regression line
abline(m1,
lwd = 3)
We look at model diagnostics and conduct “residual analysis” in order to understand if the assumptions of the regression model are being met.
Model assumptions that can be investigated with model diagnostics are: * Normality (residuals should be normal) * Constant variance (variability of response variable y should not change for different values of the predictor variable x)
Another key assumption is that data were sampled randomly; this however, cannot be determined just by looking at model diagnostics.
Model diagnostics also allow us to determine * if there are outliers * if outliers are having undo influence on the results of the model * data points with high influence are sometimes said to have high leverage
This can be investigated with * A histogram of the raw residuals * A “qqplot” (quantile-quantile plot)
Get residuals with resid() function
m1.resid <- resid(m1)
Histogram of residuals with hist()
hist(m1.resid)
Does this indicate normality of the residuals
R’s plot() function can automatically make a qqplot if you give it the model from the lm() function and tell it “which =”
plot(m1, which = 2,
main = "plot(model, which =2)")
There is some indication of non-normality but it isn’t horrible. Could be better, could be a lot worse.
We can look at these side by side w/the par() command
#Set par()
par(mfrow = c(1,2))
#make histogram of residuals
hist(m1.resid, main = "hist(residuals)")
#make qqplot
plot(m1, which = 2,
main = "plot(model, which =2)")
#re-set par()
par(mfrow = c(1,1))
Plot residuals of model against fitted values. We can get fitted values with the fitted() function. The plot() function cal also do this automatically if we tell it “which = 1”"
plot(m1,
which=1)
This can be better seen here
#plot data
plot(age.years ~ portion.black,
data = dat,
col = 2,
pch = 2)
#add regression line
abline(m1,
lwd = 3)
One of the data points falls rather far from the line
Key idea * Potential outliers will have a large distance between them and the regression line * That is, they will have a large residual
We can investigate this issue using plot()
plot(m1, which = 5)
inf <- influence.measures(m1)
hat32 <- inf$infmat[32,"hat"]
r32 <- rstandard(m1)[32]
arrows(x0 = 0.125,x1 = c(hat32-0.015),
y0 = r32, y1 =r32)
If we set par() we can put all of the diagnostics together.
par(mfrow = c(2,2))
#plot histogram of residual
hist(m1.resid)
#plot qqplot
plot(m1, which = 2)
#Plot residual vs. fitted
plot(m1, which = 1)
#plot outlier diagnostics
plot(m1, which = 5)
#reset par
par(mfrow = c(1,1))
When data do not meet the assumption of linear regression * p-values will be inaccurate (usually too small) * Confidence intervals will be inaccurate (usually to narrow)
Log transformations can help the data better match the assumptions of the model.
The log transformation has several useful properties * It can improve normality * Reduce the spreading of point in residuals vs. fitted (“stabilize the variance”) * Reduce the impact of outliers * Log transformation will help assure that our p-vales and confidence intervals are accurate * A major drawback of the log transformation is that its hard to interpret the parameters in a transformed model (b/c we don’t measure things on the log scale in nature and conceptualize the log scale easily in our head)
Fit model with log(age.years) inside of lm(…)
m1.log <- lm( log(age.years) ~ portion.black, data = dat)
plot( log(age.years) ~ portion.black, data = dat)
abline(m1.log, col = 2, lwd = 2)
summary(m1.log)
##
## Call:
## lm(formula = log(age.years) ~ portion.black, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94095 -0.21100 -0.08545 0.31254 0.71893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5534 0.1297 4.267 0.000182 ***
## portion.black 2.2995 0.3452 6.661 2.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3851 on 30 degrees of freedom
## Multiple R-squared: 0.5966, Adjusted R-squared: 0.5831
## F-statistic: 44.36 on 1 and 30 DF, p-value: 2.237e-07
Look at the transformed model output and write down * _transformed model’s “word” equation * _transformed model’s numeric equation * ___p-value * ___whether it is significant relative alpha = 0.05 * ___R2 (also written R^2)
How does p-value and R2 compare between m1 and m1.log?
Let’s see how the transformation impacts that model.
The original model (m1) had some signed of non-normality
Get residuals with resid() function
m1.log.resid <- resid(m1.log)
Histogram of residuals with hist()
hist(m1.log.resid)
Does this indicate normality of the residuals?
quantile-quantile plot
plot(m1.log, which = 2)
We can look at these side by side w/the par() command
#Set par()
par(mfrow = c(1,2))
#make histogram of residuals
hist(m1.log.resid)
#make qqplot
plot(m1.log, which = 2)
#re-set par()
par(mfrow = c(1,1))
plot(m1.log,
which=1)
There is no longer any obvious spread to this plot, but the residual do seem to go down, up, then back down…
#plot data
plot(log(age.years) ~ portion.black,
data = dat,
col = 2,
pch = 2)
#add regression line
abline(m1.log,
lwd = 3)
What has happened to the previous “outlier”
Recall * Points with high leverage can have a large impact on regression results * If points with high leverage also have large residuals this can be problematic
We can investigate this issue using plot()
plot(m1.log, which = 5)
Recall: * Points between the red lines labeled 0.5 are generally considered ok * Points between the 0.5 and 1 lines might be problematic * Points outside the red lines deserve careful consideration
If we set par() we can put all of the diagnostics together.
par(mfrow = c(2,2))
#1)plot histogram of residual
m1.log.resid <- resid(m1.log)
hist(m1.log.resid)
#2)plot qqplot
plot(m1.log, which = 2)
#3)Plot residual vs. fitted
plot(m1.log, which = 1)
#4)plot outlier diagnostics
plot(m1.log, which = 5)
#reset par
par(mfrow = c(1,1))
Raw data
plot(age.years ~ portion.black,
data = dat)
dat.nooutlier <- dat[-32,]
plot(age.years ~ portion.black,
data = dat.nooutlier)
scatter.smooth(dat.nooutlier$portion.black,
dat.nooutlier$age.years,
lpars = list(col = 2, lwd =2))
Not really
scatter.smooth(dat.nooutlier$portion.black,
log(dat.nooutlier$age.years), #log this line
lpars = list(col = 2,
lwd =2))
We can try to account for this curviness by including another parameter in our model. This term is sometimes called * a “non-linear” term * or an “x^2”
Its called an “x^2” term b/c we try to account for the curviness by modeling the x variable (portion.black) AND the x variable squared (portion.black^2)
IMPORTANT: Its annoying, but we have to put and “I(…)” around the x^2 term, like this I(portion.black^2)
m2.x2 <- lm(age.years ~ portion.black +
I(portion.black^2), #Note the I!!!!
data = dat.nooutlier)
summary(m2.x2)
##
## Call:
## lm(formula = age.years ~ portion.black + I(portion.black^2),
## data = dat.nooutlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1276 -1.0977 -0.1634 0.9544 3.2395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6431 0.9351 0.688 0.4973
## portion.black 13.5767 5.6626 2.398 0.0234 *
## I(portion.black^2) -6.0439 6.9550 -0.869 0.3922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.443 on 28 degrees of freedom
## Multiple R-squared: 0.5901, Adjusted R-squared: 0.5609
## F-statistic: 20.16 on 2 and 28 DF, p-value: 3.774e-06
The x^2 term (list as “I(portion.black^2)”) is not significant. We’ll explore what happens with including this term in the model
Don’t worry too much about how this code works
#predictiosn
m2.x2preds <- fitted(m2.x2)
plot(m2.x2preds ~ dat.nooutlier$age.years)
#add predictions to dataframe
dat.nooutlier$m2.x2preds <- m2.x2preds
#plot RAW data with smoother
scatter.smooth(dat.nooutlier$portion.black,
dat.nooutlier$age.years, #log this line
lpars = list(col = 2,
lwd =2))
# ADD line of the non-linear model with x^2
i <- order(dat.nooutlier$portion.black)
points(m2.x2preds ~ portion.black, dat.nooutlier[i,], type = "l", col = 3, lwd = 2, lty = 2)
legend("topleft", legend = c("smoother","x^2 model"),col = c(2:3), lty = c(1:2), cex = 0.8)
plot(age.years ~ portion.black,
data = dat)
abline(m1)
R2 <- round(summary(m1)$adj.r.squared,2)
R2 <- paste("R^2 = ",R2)
legend("topleft", legend = R2)
n <- dim(dat)[1]
#Plot raw data
plot(age.years ~ portion.black,
data = dat, ylim = c(0,14))
abline(m1)
x.obs <- dat$portion.black
y.preds <- predict(m1)
y.obs <- dat$age.years
y.noise.worse <- ifelse(y.obs > y.preds, c(y.obs+0.5),
c(y.obs-0.5))
#model noise
m1.noise.worse <- lm(y.noise.worse ~ x.obs)
#plot noise
points(y.noise.worse ~ x.obs,
data = dat, col = 2)
R2.obs <- round(summary(m1)$adj.r.squared,2)
R2.obs <- paste("R^2 orig = ",R2.obs)
R2.worse <- round(summary(m1.noise.worse)$adj.r.squared,2)
R2.worse <- paste("R^2 = ",R2.worse)
legend("topleft", legend = c(R2.obs,R2.worse), col = c(1,2),
pch = 1)
n <- dim(dat)[1]
#Plot raw data
plot(age.years ~ portion.black,
data = dat, ylim = c(0,14))
abline(m1)
x.obs <- dat$portion.black
y.preds <- predict(m1)
y.obs <- dat$age.years
y.noise.better <- ifelse(y.obs > y.preds, c(y.obs-0.5),
c(y.obs+0.5))
#model noise
m1.noise.better <- lm(y.noise.better ~ x.obs)
#plot noise
points(y.noise.better ~ x.obs,
data = dat, col = 2)
R2.obs <- round(summary(m1)$adj.r.squared,2)
R2.obs <- paste("R^2 = ",R2.obs)
R2.better <- round(summary(m1.noise.better)$adj.r.squared,2)
R2.better <- paste("R^2 = ",R2.better)
legend("topleft", legend = c(R2.obs,R2.better), col = c(1,2),
pch = 1)
n <- dim(dat)[1]
#Plot raw data
plot(age.years ~ portion.black,
data = dat, ylim = c(0,14))
abline(m1)
x.obs <- dat$portion.black
y.preds <- predict(m1)
y.obs <- dat$age.years
y.perfect <- y.preds
#model noise
m1.perfect <- lm(y.perfect ~ x.obs)
#plot noise
points(y.perfect ~ x.obs,
data = dat, col = 2)
R2.obs <- round(summary(m1)$adj.r.squared,2)
R2.obs <- paste("R^2 = ",R2.obs)
R2.perfect <- round(summary(m1.perfect)$adj.r.squared,2)
## Warning in summary.lm(m1.perfect): essentially perfect fit: summary may be
## unreliable
R2.perfect <- paste("R^2 = ",R2.perfect)
legend("topleft", legend = c(R2.obs,R2.perfect), col = c(1,2),
pch = 1)