I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 5 in STAT 501 is Multiple Linear Regression. The SAS version of the code below is posted on my blog.

Overview

This lesson extends the concepts of SLR to multiple Linear Regression (MLR).

R Concepts Worth Noting

  • The first step in exploratory data analysis is plotting the data. Create a scatter plot matrix using the pairs function.

  • Create a panel of plots using the grid.arrange function.

Data Management for this Module

library(readr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# iqsize is a tab-delimited data file.
# Performance IQ scores, brain size, height, and weight of n=38 college students. 
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/iqsize.txt'
iqsize <- read.table(url, sep = '\t', header = TRUE)
#head(iqsize)
#summarise(iqsize, n=n())
# babybirds is a tab-delimited data file.
# Venillation, O2, and CO2 for sample of n=120 nestling bank swallows. 
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/babybirds.txt'
babybirds <- read.table(url, sep = '\t', header = TRUE)
#head(babybirds)
#summarise(babybirds, n=n())
# pastry is a tab-delimited data file.
# Taster rating, moisture level, and sweetness level for sample of n=16 pasteries. 
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/pastry.txt'
pastry <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(pastry)
#summarise(pastry, n=n())
# stat_females is a tab-delimited data file.
# Height (inches), mother's height, and fathers height for n=214 female college statistics students. 
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/stat_females.txt'
stat_females <- read.table(url, sep = '\t', header = TRUE)
#head(stat_females)
#summarise(stat_females, n=n())
# hospital_infct is a tab-delimited data file.
# Infection risk, lenghth of hospital stay, average patient age, and number of x-rays at n=113 hospitals.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/hospital_infct.txt'
hospital_infct <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(hospital_infct)
#summarise(hospital_infct, n=n())
# bodyfat is a tab-delimited data file.
# Body fat, triceps thickness, thigh circumference, and midarm circumference for sample of n=20 individuals.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/bodyfat.txt'
bodyfat <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(bodyfat)
#summarise(bodyfat, n=n())

5.1 Example on IQ and Physical Characteristics

Consider the research question, is brain size and body size predictive of intelligence? Data set iqsize contains performance IQ scores PIQ, brain size from MRI scans Size (count/10,000), height Height (inches), and weight weight (pounds) for n=38 college students.

A common way of investigating the relationships among all of the variables is by way of a “scatter plot matrix.” Height and Weight appear to be correlated, and so does Brain with Height and with Weight. PIQ appears to be correlated with Brain, but not so much with Height or Weight.

pairs(iqsize)

Run a regression PIQ ~ Brain + Height + Weight. What can we conclude?
iqsize.lm <- lm(PIQ ~ Brain + Height + Weight, data = iqsize)
summary(iqsize.lm)
## 
## Call:
## lm(formula = PIQ ~ Brain + Height + Weight, data = iqsize)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.74 -12.09  -3.84  14.17  51.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.114e+02  6.297e+01   1.768 0.085979 .  
## Brain        2.060e+00  5.634e-01   3.657 0.000856 ***
## Height      -2.732e+00  1.229e+00  -2.222 0.033034 *  
## Weight       5.599e-04  1.971e-01   0.003 0.997750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.79 on 34 degrees of freedom
## Multiple R-squared:  0.2949, Adjusted R-squared:  0.2327 
## F-statistic: 4.741 on 3 and 34 DF,  p-value: 0.007215

5.2 Example on Underground Air Quality

Consider the research question, do nestling bank swallows, which live in underground burrows, also alter how they breathe? Data set babybirds contains volume of air breathed per minute for each of 6 nestling bank swallows Vent, percentage of oxygen in the air O2, and percentage of carbon dioxide in the air CO2 for n=120 nestling bank swallows.

The scatterplot matrix suggests:
pairs(babybirds)

A reasonable first-order model with two quantitative predictors is \(y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + e_i\) where the independent error terms \(e_i\) follow a normal distribution with mean 0 and equal variance \(\sigma^2\).

From this equation we can conduct hypothesis tests on the parameter estimators, or calculate a confidence interval for a mean response at specified explantory variable values.

5.3 The Multiple Linear Regression Model

The concepts for the MLR model are similar to those of the SLR. One addition is the adjusted \(R^2\) which does not necessarily increase as more predictors are added, can be useful to identify which predictors should be included in a model. \(Adj R^2 = 1 - ((n-1)/(n-p)(1-R^2)\).

In general, the interpretation of a slope in multiple regression can be tricky. Correlations among the predictors can change the slope values dramatically from what they would be in separate simple regressions.

5.4 A Matrix Formulation of the Multiple Regression Model

In matrix notation, \(b=(X'X)^{-1}X'Y\).

5.5 Further Examples

Example 1: Pastry Sweetness Data

Consider the research question, how does moisture content and sweetness of a pastry affect a taster’s rating of the productd? Data set pastry contains taster rating Rating, moisture level Moisture (4 levels), and sweetness level Sweetness (2 levels) for n=16 pasteries.

Here the best way to look at the categorical level is with overlayed scatterplots. There is a linear relationship between rating and moisture and there is also a sweetness difference.

pastry.s2 <- pastry %>% filter(Sweetness == 2)
pastry.s4 <- pastry %>% filter(Sweetness == 4)
ggplot(pastry.s2, aes(y=Rating, x=Moisture)) +
  geom_point(aes(y=pastry.s2$Rating, x=pastry.s2$Moisture), 
             color="red") +
  geom_smooth(method = "lm", data=pastry.s2, se = FALSE, colour="red", 
              linetype=2, size=.5) +
  geom_point(aes(y=pastry.s4$Rating, x=pastry.s4$Moisture), 
             color="blue") +
  geom_smooth(method = "lm", data=pastry.s4, se = FALSE, colour="blue", 
              linetype=2, size=.5)

Here are linear regression models, two simple ones for Moisture and for Sweetness, and one multiple regression.

  • The Moisture coefficient is 4.4250 in both the SLR and MLR, and the Sweetness coefficient is 4.375 in both the SLR and MLR. This only happens because Moisture and Sweetness are not correlated, so the estimated slopes are independent of each other.

  • The R2 for the MLR, .9521, is the sum of the R2 for the two SLR’s (.7964 and .1557), again because Moisture and Sweetness are not correlated.

  • Sweetness is not statistically significant in the SLR (p=0.13), but it is in the MLR. Including both variables in the equation reduced the standard deviation of the residuals (notice the S values). This in turn reduces the standard errors of the coefficients, leading to greater t-values and smaller p-values.
pastry.lm1 <- lm(Rating ~ Moisture, data = pastry)
pastry.lm2 <- lm(Rating ~ Sweetness, data = pastry)
pastry.lm3 <- lm(Rating ~ Moisture + Sweetness, data = pastry)
summary(pastry.lm1)
## 
## Call:
## lm(formula = Rating ~ Moisture, data = pastry)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.475 -4.688 -0.100  4.638  7.525 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   50.775      4.395  11.554 1.52e-08 ***
## Moisture       4.425      0.598   7.399 3.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.349 on 14 degrees of freedom
## Multiple R-squared:  0.7964, Adjusted R-squared:  0.7818 
## F-statistic: 54.75 on 1 and 14 DF,  p-value: 3.356e-06
summary(pastry.lm2)
## 
## Call:
## lm(formula = Rating ~ Sweetness, data = pastry)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.375  -7.312  -0.125   8.688  16.625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   68.625      8.610   7.970 1.43e-06 ***
## Sweetness      4.375      2.723   1.607     0.13    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.89 on 14 degrees of freedom
## Multiple R-squared:  0.1557, Adjusted R-squared:  0.09539 
## F-statistic: 2.582 on 1 and 14 DF,  p-value: 0.1304
summary(pastry.lm3)
## 
## Call:
## lm(formula = Rating ~ Moisture + Sweetness, data = pastry)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.400 -1.762  0.025  1.587  4.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
## Moisture      4.4250     0.3011  14.695 1.78e-09 ***
## Sweetness     4.3750     0.6733   6.498 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared:  0.9521, Adjusted R-squared:  0.9447 
## F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09

Example 2: Female Stat Students

Consider the research question, are female heights more closely predicted by mother’s height or by father’s height? Data set stat_females contains heights of female college students Height (inches), and mother’s height momheight and father’s height dadheight for n=214 students.

A side-by-side scatterplot shows a moderate positive association with a straight-line pattern and no notable outliers for both momheight and dadheight explanatory variables.

stat_females.plot1 <- ggplot(stat_females, aes(y=Height, x=momheight)) +
  geom_point() + 
  xlab("Student Height (inches)") +
  ggtitle("Mom's Height (inches)")
stat_females.plot2 <- ggplot(stat_females, aes(y=Height, x=dadheight)) +
  geom_point() + 
  xlab("Student Height (inches)") +
  ggtitle("Dad's Height (inches)")
grid.arrange(stat_females.plot1, stat_females.plot2, ncol=2)

Regress Height ~ momheight + dadheight. Evaluate the model assumptions with a residuals plot. It looks about as it should - a random horizontal band of points. Other residual analyses can be done exactly as we did for simple regression. For instance, we might wish to examine a normal probability plot of the residuals. Additional plots to consider are plots of residuals versus each x-variable separately. This might help us identify sources of curvature or non-constant variance.

stat_females.lm <- lm(Height ~ momheight + dadheight, data = stat_females)
stat_females.res <- data.frame(res=resid(stat_females.lm))
stat_females.fit <- data.frame(fit=fitted(stat_females.lm))
stat_females <- cbind(stat_females, stat_females.res, stat_females.fit)
summary(stat_females.lm)
## 
## Call:
## lm(formula = Height ~ momheight + dadheight, data = stat_females)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2748 -1.5562 -0.0372  1.4721  5.7993 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.54725    3.69278   5.023 1.08e-06 ***
## momheight    0.30351    0.05446   5.573 7.61e-08 ***
## dadheight    0.38786    0.04721   8.216 2.10e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.031 on 211 degrees of freedom
## Multiple R-squared:  0.4335, Adjusted R-squared:  0.4281 
## F-statistic: 80.73 on 2 and 211 DF,  p-value: < 2.2e-16
ggplot(data=stat_females, aes(y = res, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  xlab("Fitted Value") +
  ylab("Residual") +
  ggtitle("Residuals vs. Fits Plot")

Example 3: Hospital Data

Consider the research question, what is the best predictor of infection risk in a hospital? Data set hospital_infct contains infection risk InfctRsk, length of stay Stay, average age Age, and number of xrays Xray for n=113 hospitals.

Regress InfctRsk ~ Stay + Age + Xray. Here we just note that the Age coefficient is not statistically significant (p=.3301).

hospital_infct.lm <- lm(InfctRsk ~ Stay + Age + Xray, data = hospital_infct)
summary(hospital_infct.lm)
## 
## Call:
## lm(formula = InfctRsk ~ Stay + Age + Xray, data = hospital_infct)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77320 -0.73779 -0.03345  0.73308  2.56331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.001162   1.314724   0.761 0.448003    
## Stay         0.308181   0.059396   5.189 9.88e-07 ***
## Age         -0.023005   0.023516  -0.978 0.330098    
## Xray         0.019661   0.005759   3.414 0.000899 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.085 on 109 degrees of freedom
## Multiple R-squared:  0.363,  Adjusted R-squared:  0.3455 
## F-statistic:  20.7 on 3 and 109 DF,  p-value: 1.087e-10

Example 4: Physiological Measurements Data

Consider the research question, what is the best predictor of percent body fat? Data set bodyfat contains body fat Bodyfat, triceps thickness Triceps, thigh circumference Thigh, and midarm circumference Midarm for a sample of n=20 individuals.

Regress Bodyfat ~ Triceps + Thigh + Midarm.

bodyfat.lm <- lm(Bodyfat ~ Triceps + Thigh + Midarm, data = bodyfat)
summary(bodyfat.lm)
## 
## Call:
## lm(formula = Bodyfat ~ Triceps + Thigh + Midarm, data = bodyfat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7263 -1.6111  0.3923  1.4656  4.1277 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  117.085     99.782   1.173    0.258
## Triceps        4.334      3.016   1.437    0.170
## Thigh         -2.857      2.582  -1.106    0.285
## Midarm        -2.186      1.595  -1.370    0.190
## 
## Residual standard error: 2.48 on 16 degrees of freedom
## Multiple R-squared:  0.8014, Adjusted R-squared:  0.7641 
## F-statistic: 21.52 on 3 and 16 DF,  p-value: 7.343e-06