1 Introduction

The goal of this lab is to * Review assumptions of ANOVA/t-tests/regression * Review diagnostics plots for normality and constant variance * Introduce diagnostics plots for outliers * Investigate the role of transformation on diagnostic plots * Introduce how to model curved (non-linear) data with x^2 terms * BONUS: material on R^2 for tomorrow’s lecture occurs at the end

Source Data

Data on the relationship between the amount of pigment on lion snouts and their age from Whitman et al (2004). These data are featured in Chapter 17 of Whitlock & Shulter’s Analysis of Biological Data, 2nd ed.

The original data was presented in Figure 4, pg 2, of Whitman

References: Whitman, K, AM Starfield, HS Quadling and C Packer. 2004. Sustainable trophy hunting of African lions. Nature.

1.1 Preliminaries

#The following sets up the data fro the analysis
#Set working directory

setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lecture/Unit3_regression/last_week")

Load data

dat <- read.csv("lion_age_by_pop.csv")

2 Plot Raw Data

2.1 Basic R plot

plot(age.years ~ portion.black, 
     data = dat)

2.2 Change color w/ col =

plot(age.years ~ portion.black, 
     data = dat,
     col = 2)





2.3 Change symbol w/ pch =

plot(age.years ~ portion.black, 
     data = dat,
     col = 2,
     pch = 2)









3 Fit basic regression model

3.1 Use raw data (no transformation)

m1 <- lm(age.years ~ portion.black, data = dat)





3.2 Look at model w/summary

summary(m1)
## 
## Call:
## lm(formula = age.years ~ portion.black, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5457 -1.1457 -0.3384  0.9245  4.3426 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.9262     0.5591   1.657    0.108    
## portion.black  10.5827     1.4884   7.110 6.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.66 on 30 degrees of freedom
## Multiple R-squared:  0.6276, Adjusted R-squared:  0.6152 
## F-statistic: 50.55 on 1 and 30 DF,  p-value: 6.59e-08

3.2.1 Tasks: write equation, p value, and R2 for regression model

Look at the model output and write down * _transformed model’s “word” equation * _model’s numeric equation * ___p-value * ___whether it is significant relative alpha = 0.05 * ___R2 (also written R^2)

R2 can be thought of as * How well the model fits the data * How much of the data the model explains





3.3 Plot regrssion line on scatterplot with abline()

#plot data
plot(age.years ~ portion.black, 
     data = dat,
     col = 2,
     pch = 2)

#add regression line
abline(m1,
       lwd = 3)







4 Look at model diagnostics

We look at model diagnostics and conduct “residual analysis” in order to understand if the assumptions of the regression model are being met.

Model assumptions that can be investigated with model diagnostics are: * Normality (residuals should be normal) * Constant variance (variablity of response variable y should not change for diffrent values of the predicotr variable x)

Another key assumption is that data were sampled randomly; this however, cannot be determined just by looking at model diagnostics.

Model diagnostics also allow us to determine * if there are outliers * if outliers are having undo influence on the results of the model * data points with high influence are sometimes said to have high leverage







4.1 Assess asumption of normality

THis can be investigated with * A histogram of the raw residuals * A “qqplot” (quantile-quantile plot)



4.1.1 Plot histogram of residuals

Get residuals with resid() function

m1.resid <- resid(m1)



Histogram of residuals with hist()

hist(m1.resid)

Does this indicate normality of the residuals



4.1.2 Plot qqplot

R’s plot() function can automatically make a qqplot if you give it the model from the lm() fucntion and tell it “which =”

plot(m1, which = 2)

There is some indication of non-normality but it isn’t horrible. COuld be better, could be a lot worse.





4.1.3 Optional: Plot both diagnostics side by side

We can look at these side by side w/the par() command

#Set par()
par(mfrow = c(1,2))

#make histogram of residuals
hist(m1.resid)

#make qqplot
plot(m1, which = 2)

#re-set par()
par(mfrow = c(1,1))





4.2 Assess asumption of Constant Variance

Plot residuals of model agains fitted values. We can get fitted values with the fitted() function. The plot() function cal also do this authomatically if we tell it “which = 1”"

4.2.1 Plot Residuals ~ Fitted

plot(m1, 
     which=1)

  • The spread of the residuals (on the y axis) changes as the fitted values (x axis) increase.
  • The variance is therefore “not constant” as the x-axis changes
  • This is a major problem and should cause more concern than non-normality





This can be better seen here







4.3 Investigate outliers of regression with raw data

4.3.0.1 Plot raw data with the regression line

#plot data
plot(age.years ~ portion.black, 
     data = dat,
     col = 2,
     pch = 2)

#add regression line
abline(m1,
       lwd = 3)





One of the data points falls rather far from the lin

  • This point could qualify as an outlier.
  • Outliers are “extreme observations”
  • Outliers can be due to mistakes while collecting or entering data
  • They can also occur just due to chance by observing and individuals at the further end of the population distribution
  • Outlier are therefore not necessarily “bad data”
  • They can make statistical analyses problematic

Key idea * Potential outliers will have a large distnace between them and the regression line * That is, they will have a large residual





4.3.1 Plot residual of potential outlier





4.3.2 Plot outlier diagnostics

  • Points with high leverage can have a large impact on regression results
  • If points with high leverage also have large residuals this can be problematic

We can investigate this issue using plot()

plot(m1, which = 5)

  • Points between the red lines labeled 0.5 are generally considered ok
  • Points between the 0.5 and 1 lines might be problematic
  • POints outside the red lines deserve careful consideration
  • The point marked “32”, which is our point w/the largest residual, is near the red 1 line.







4.3.3 Plot all diagnostics together

If we set par() we can put all of the diagnostics together.

par(mfrow = c(2,2))

#plot histogram of residual
hist(m1.resid)

#plot qqplot
plot(m1, which = 2)

#Plot residual vs. fitted
plot(m1, which = 1)

#plot outlier diagnostics
plot(m1, which = 5)

#reset par
par(mfrow = c(1,1))







5 Log transformation to meet model assumptions

When data do not meet the assumption of linear regression * p-values will be inaccurate (usually too small) * Confidence intervals will be inaccurate (usually to narrow)

Log transformations can help the data better match the assumptions of the model.

The log transformation has several useful properties * It can improve normality * Reduce the spreading of point in residula vs. fitted (“stabilize the variance”) * Reduce the impact of outliers * Log transformation will help assure that our p-vales and confidence intervals are accurate * A major drawback of the log transformation is that its hard to intpret the parameters in a transformed model (b/c we don’t measure things on the log scale in nature and conceptualize the log scale easily in our head)





5.1 Refit regression model w/ log transformation

Fit model with log(age.years) inside of lm(…)

m1.log <- lm( log(age.years) ~ portion.black, data = dat)

5.2 Plot transformed data and model

  • If we want to plot the model against the data, we need to also transform what we plot.
  • THe y-axis is now log(age.years)
plot( log(age.years) ~ portion.black, data = dat)
abline(m1.log, col = 2, lwd = 2)





5.3 Look at summary of transformed data



summary(m1.log)
## 
## Call:
## lm(formula = log(age.years) ~ portion.black, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94095 -0.21100 -0.08545  0.31254  0.71893 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.5534     0.1297   4.267 0.000182 ***
## portion.black   2.2995     0.3452   6.661 2.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3851 on 30 degrees of freedom
## Multiple R-squared:  0.5966, Adjusted R-squared:  0.5831 
## F-statistic: 44.36 on 1 and 30 DF,  p-value: 2.237e-07

5.3.1 Tasks: write equation, p value, and R2 for log transformed regression model

Look at the transformed model output and write down * _transformed model’s “word” equation * _transformed model’s numeric equation * ___p-value * ___whether it is significant relative alpha = 0.05 * ___R2 (also written R^2)

How does p-value and R2 compare between m1 and m1.log?





5.4 Look at model diagnostics for transformed model

Let’s see how the transformation impacts that model.

5.4.1 Assess asumption of normality for transformed model

The original model (m1) had some signed of non-normality

5.4.1.1 Plot histogram of residuals from logged model

Get residuals with resid() function

m1.log.resid <- resid(m1.log)



Histogram of residuals with hist()

hist(m1.log.resid)

Does this indicate normality of the residuals?





5.4.1.2 Plot qqplot for transformed data

quantile-quantile plot

plot(m1.log, which = 2)





5.4.1.3 Optional: Plot both diagnostics side by side

We can look at these side by side w/the par() command

#Set par()
par(mfrow = c(1,2))

#make histogram of residuals
hist(m1.log.resid)

#make qqplot
plot(m1.log, which = 2)

#re-set par()
par(mfrow = c(1,1))





5.4.2 Assess asumption of Constant Variance

5.4.2.1 Plot Residuals ~ Fitted

plot(m1.log, 
     which=1)

There is no longer any obvious spread to this plot, but the residual do seem to go down, up, then back down…







5.4.3 Investigate outliers in regression with logged data

5.4.3.1 Plot raw data with the transformed regression line

#plot data
plot(log(age.years) ~ portion.black, 
     data = dat,
     col = 2,
     pch = 2)

#add regression line
abline(m1.log,
       lwd = 3)





What has happend to the previous “outlier”





5.4.3.2 Plot residual of former outlier





5.4.3.3 Plot formal outlier diagnostics

Recall * Points with high leverage can have a large impact on regression results * If points with high leverage also have large residuals this can be problematic

We can investigate this issue using plot()

plot(m1.log, which = 5)

Recall: * Points between the red lines labeled 0.5 are generally considered ok * Points between the 0.5 and 1 lines might be problematic * POints outside the red lines deserve careful consideration







5.4.3.4 Plot all diagnostics together

If we set par() we can put all of the diagnostics together.

par(mfrow = c(2,2))

#1)plot histogram of residual
m1.log.resid <- resid(m1.log)
hist(m1.log.resid)

#2)plot qqplot
plot(m1.log, which = 2)

#3)Plot residual vs. fitted
plot(m1.log, which = 1)

#4)plot outlier diagnostics
plot(m1.log, which = 5)

#reset par
par(mfrow = c(1,1))





6 Non-linear data

Raw data

plot(age.years ~ portion.black, 
     data = dat)





6.1 What if we drop that potential outlier?

dat.nooutlier <- dat[-32,]
plot(age.years ~ portion.black, 
     data = dat.nooutlier)





6.2 Is there curvature to the data?

  • Add “smoother” w/scatter.smooth
  • Note that we can’t use a “~” here, use comma instead
  • Also have to reverse order of terms (portion.black comes first)
scatter.smooth(dat.nooutlier$portion.black, 
               dat.nooutlier$age.years,
               lpars = list(col = 2, lwd =2))

  • Some curvature in the data
  • This indicates some “non-linearity”

6.3 Does log impact the curve?

Not really

scatter.smooth(dat.nooutlier$portion.black, 
               log(dat.nooutlier$age.years),  #log this line
               lpars = list(col = 2, 
                            lwd =2))



6.4 Fit regression model w/ “non-linear term”

We can try to account for this curviness by including another parameter in our model. This term is sometimes called * a “non-linear” term * or an “x^2”

Its called an “x^2” term b/c we try to account for the curviness by modeling the x variable (portion.black) AND the x variable squareed (portion.black^2)

6.4.1 Model w/ x^2 term

IMPORTANT: Its annyone, but we have to put and “I(…)” around the x^2 term, like this I(portion.black^2)

m2.x2 <- lm(age.years ~ portion.black +
                 I(portion.black^2),  #Note the I!!!!
     data = dat.nooutlier)


6.4.1.1 Look at summary of modle with x^2 term

summary(m2.x2)
## 
## Call:
## lm(formula = age.years ~ portion.black + I(portion.black^2), 
##     data = dat.nooutlier)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1276 -1.0977 -0.1634  0.9544  3.2395 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          0.6431     0.9351   0.688   0.4973  
## portion.black       13.5767     5.6626   2.398   0.0234 *
## I(portion.black^2)  -6.0439     6.9550  -0.869   0.3922  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.443 on 28 degrees of freedom
## Multiple R-squared:  0.5901, Adjusted R-squared:  0.5609 
## F-statistic: 20.16 on 2 and 28 DF,  p-value: 3.774e-06

The x^2 term (list as “I(portion.black^2)”) is not significant. We’ll explore what happens with including this term in the model

6.4.1.2 Predictions from model with R^2 term

Don’t worry too much about how this code works

#predictiosn
m2.x2preds <- fitted(m2.x2)


plot(m2.x2preds ~ dat.nooutlier$age.years)

#add predictions to dataframe
dat.nooutlier$m2.x2preds <- m2.x2preds


#plot RAW data with smoother
scatter.smooth(dat.nooutlier$portion.black, 
               dat.nooutlier$age.years,  #log this line
               lpars = list(col = 2, 
                            lwd =2))

# ADD line of the non-linear model with x^2
i <- order(dat.nooutlier$portion.black)
points(m2.x2preds ~ portion.black, dat.nooutlier[i,], type = "l", col = 3, lwd = 2, lty = 2)
legend("topleft", legend = c("smoother","x^2 model"),col = c(2:3), lty = c(1:2), cex = 0.8)






7 Investigate R^2

  • R^2 indicates how well the model fits the data
  • The higher R^2, the more the model can explain variation in the data
  • Models can have low p-values (“highly significant”) yet not have high R^2 (therefore they don’t explain much of what is going on in the data)

7.1 Plot raw data

plot(age.years ~ portion.black, 
     data = dat)
abline(m1)
R2 <- round(summary(m1)$adj.r.squared,2)
R2 <- paste("R^2 = ",R2)
legend("topleft", legend =  R2)





7.2 Plot more variable data

n <- dim(dat)[1]

#Plot raw data
plot(age.years ~ portion.black, 
     data = dat, ylim = c(0,14))
abline(m1)
x.obs <- dat$portion.black
y.preds <- predict(m1)
y.obs <- dat$age.years

y.noise.worse <- ifelse(y.obs > y.preds, c(y.obs+0.5), 
                  c(y.obs-0.5))

#model noise
m1.noise.worse <- lm(y.noise.worse ~ x.obs)

#plot noise
points(y.noise.worse ~ x.obs, 
     data = dat, col = 2)

R2.obs <- round(summary(m1)$adj.r.squared,2)
R2.obs <- paste("R^2 orig = ",R2.obs)

R2.worse <- round(summary(m1.noise.worse)$adj.r.squared,2)
R2.worse <- paste("R^2 = ",R2.worse)


legend("topleft", legend =  c(R2.obs,R2.worse), col = c(1,2),
       pch = 1)

7.3 Plot less variable data

n <- dim(dat)[1]

#Plot raw data
plot(age.years ~ portion.black, 
     data = dat, ylim = c(0,14))
abline(m1)
x.obs <- dat$portion.black
y.preds <- predict(m1)
y.obs <- dat$age.years

y.noise.better <- ifelse(y.obs > y.preds, c(y.obs-0.5), 
                  c(y.obs+0.5))

#model noise
m1.noise.better <- lm(y.noise.better ~ x.obs)

#plot noise
points(y.noise.better ~ x.obs, 
     data = dat, col = 2)

R2.obs <- round(summary(m1)$adj.r.squared,2)
R2.obs <- paste("R^2 = ",R2.obs)

R2.better <- round(summary(m1.noise.better)$adj.r.squared,2)
R2.better <- paste("R^2 = ",R2.better)


legend("topleft", legend =  c(R2.obs,R2.better), col = c(1,2),
       pch = 1)



7.4 Plot perfect data

n <- dim(dat)[1]

#Plot raw data
plot(age.years ~ portion.black, 
     data = dat, ylim = c(0,14))
abline(m1)
x.obs <- dat$portion.black
y.preds <- predict(m1)
y.obs <- dat$age.years

y.perfect <- y.preds

#model noise
m1.perfect <- lm(y.perfect ~ x.obs)

#plot noise
points(y.perfect ~ x.obs, 
     data = dat, col = 2)

R2.obs <- round(summary(m1)$adj.r.squared,2)
R2.obs <- paste("R^2 = ",R2.obs)

R2.perfect <- round(summary(m1.perfect)$adj.r.squared,2)
## Warning in summary.lm(m1.perfect): essentially perfect fit: summary may be
## unreliable
R2.perfect <- paste("R^2 = ",R2.perfect)


legend("topleft", legend =  c(R2.obs,R2.perfect), col = c(1,2),
       pch = 1)