Simple linear regression

  1. Go to http://www.stat.ufl.edu/~winner/datasets.html and find the dataset called “Head Size and Brain Weight”. Open the “Description” and the “Data” files in separate browser tabs. Read the description and then switch to the data tab and press Ctrl + a to highlight the entire dataset and copy it via Ctrl + c (On Mac use: Cmd + a and Cmd + c). Use the following code to import the data into R:
## On PC
## brains <- read.table(file = "clipboard", header = F)

# For some people the following approach also worked last time:
# brains <- readClipboard()

## On Mac
brains <- read.table(pipe(description = "pbpaste"), header = F)
  1. The data frame comes without column names. Make use of the names command to assign the following headers: gender, age.range, head.size and brain.weight
head(brains, n = 3)
names(brains) <- c("gender", "age.range", "head.size", "brain.weight")
head(brains, n = 3) # check whether name change has happened
  1. Examine the structure of the data frame and plot the variable brain.weight against head.size.
## Check the structure
str(brains)
## 'data.frame':    237 obs. of  4 variables:
##  $ gender      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ age.range   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ head.size   : int  4512 3738 4261 3777 4177 3585 3785 3559 3613 3982 ...
##  $ brain.weight: int  1530 1297 1335 1282 1590 1300 1400 1255 1355 1375 ...
plot(brain.weight ~ head.size, data = brains)

  1. Run a simple linear regression model with brain.weight as response variable and head.size as predictor variable.
m1 <- lm(brain.weight ~ head.size, data = brains)
m0 <- lm(brain.weight ~ 1, data = brains)
mean(brains$brain.weight)
## [1] 1282.873
summary(m0)
## 
## Call:
## lm(formula = brain.weight ~ 1, data = brains)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -327.87  -75.87   -2.87   67.13  352.13 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1282.873      7.817   164.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 120.3 on 236 degrees of freedom
anova(m0, m1)
  1. Use par(mfrow = c(2, 2)) to split the plot window into four panels and inspect the model diagnostic plots using plot(yourmodelname) (where ‘yourmodelname’ is obviously the object name that you gave your model). Comment on the underlying model assumptions of variance homogeneity and normality of the residuals and whether there are influential observations.
par(mfrow = c(2, 2), mar = c(3.5, 3, 1.5, 1), cex.lab = 0.9, cex.axis = 0.8, mgp = c(1.9, 0.7, 0)) # see ?par if you are interested in what all those arguments are specifying... 
plot(m1)

The residual vs. fitted values plot indicates no violation of the variance homogeneity criterion and the quantile-quantile plot suggests normality of the residuals. No influential observations were flagged.

  1. Inspect and interpret the model summary output.
summary(m1) # An intercept of 325 doesn't make any sense: no head capsule - no brain!
## 
## Call:
## lm(formula = brain.weight ~ head.size, data = brains)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -175.98  -49.76   -1.76   46.60  242.34 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 325.57342   47.14085   6.906 4.61e-11 ***
## head.size     0.26343    0.01291  20.409  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.43 on 235 degrees of freedom
## Multiple R-squared:  0.6393, Adjusted R-squared:  0.6378 
## F-statistic: 416.5 on 1 and 235 DF,  p-value: < 2.2e-16

Both, the intercept and slope paramters are statistically significant but does a significant intercept make sense biologically? The slope suggests that with an increase in 1 \(cm^3\) head volume, the brain weight increases by 0.26 g. The adjusted \(R^2\) values implies that model explains about 64% of the variations seen in the raw data.\ We should rerun the model without intercept term to force the model fit through the origin, i.e. there is no brain when head size is zero.

m2 <- lm(brain.weight ~ head.size - 1, data = brains)
summary(m2)
## 
## Call:
## lm(formula = brain.weight ~ head.size - 1, data = brains)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -234.153  -47.988    6.495   56.307  250.163 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## head.size  0.35213    0.00141   249.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.27 on 236 degrees of freedom
## Multiple R-squared:  0.9962, Adjusted R-squared:  0.9962 
## F-statistic: 6.238e+04 on 1 and 236 DF,  p-value: < 2.2e-16
  1. Add the model fit to the plot of the raw data.
plot(brain.weight ~ head.size, data = brains, ylim = c(950, 1750))
abline(m1)
abline(m2, col = "red")

  1. Use the predict function to obtain a brain weight prediction for a head size of 4750 \(cm^3\).
point.pred <- predict(m2, newdata = data.frame(head.size = 4750))
  1. Repeat d. and add the argument interval = "confidence" within the predict function. What does it return in addition to the predicted value?
predict(m2, newdata = data.frame(head.size = 4750), interval = "confidence")
##        fit      lwr      upr
## 1 1672.609 1659.416 1685.802

By specifying interval = "confidence", the predict function returns the fitted value including the upper and lower 95% confidence intervals around it.

  1. This last task is for the keen ones among you: Compute the model predictions along with their 95% confidence intervals across the entire range of head sizes on a very fine grid (the predict function returns a matrix, so create an assignment for the predictions and convert the object into a data frame). Now figure out how to add lines indicating the upper and lower bounds of the confidence intervals to the plot. Use the lines command with the argument lty = 2 (meaning dashed lines). Ok, here is the procedure:
## Create a fine grid of new predictor values
xv <- with(brains, seq(min(head.size), max(head.size), length.out = 100)) # xv is short for x-values (new set of predictor values)

## Get model predictions based on the new x-values
preds <- as.data.frame(predict(m1, newdata = data.frame(head.size = xv), interval = "confidence"))
plot(brain.weight ~ head.size, data = brains, ylab = "Brain weight (g)", xlab = expression(paste("Brain size (cm"^3, ")")))
text(x = 2800, y = 1600, labels = expression(paste(italic("r")^2, " = 0.99 ***")), adj = 0)
lines(xv, preds$fit, col = "white", lwd = 2.5) # Graphical trick: add a thick white regression line first to create a white edge, which makes the following black line more conspicuous!
lines(xv, preds$fit) # the regression line
lines(xv, preds$upr, lty = 2) # upper 95% CI
lines(xv, preds$lwr, lty = 2) # lower 95% CI

For graph enthusiasts, we can fancy things up by using the polygon function to plot the 95% confidence interval as a grey area underlying the points and the regression line.

plot(brain.weight ~ head.size, data = brains, ylab = "Brain weight (g)", xlab = expression(paste("Brain size (cm"^3, ")")))
text(x = 2800, y = 1600, labels = expression(paste(italic("r")^2, " = 0.99 ***")), adj = 0)
polygon(x = c(xv, rev(xv)), y = c(preds$upr, rev(preds$lwr)), col = "grey", border = NA) # this plot over the points, so we'll have to replot those again...
points(brain.weight ~ head.size, data = brains, pch = 21, bg = "black", col = "white", lwd = 0.7, cex = 1.1) # makes black points with tiny white margin to enhance contrast in overlapping points
lines(xv, preds$fit, col = "white", lwd = 2.5) # Graphical trick: add a thick white regression line first to create a white edge, which makes the following black line more conspicuous!
lines(xv, preds$fit) # the regression line