5 Multiple Linear Regression

I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 5 in STAT 501 is Multiple Linear Regression. The SAS version of the code below is posted on my blog.

Overview

This lesson extends the concepts of SLR to multiple Linear Regression (MLR).

The first step in exploratory data analysis is plotting the data. Create a scatter plot matrix.
The concepts for the MLR model are similar to those of the SLR. One addition is the adjusted \(R^2\) which does not necessarily increase as more predictors are added, can be useful to identify which predictors should be included in a model. \(Adj R^2 = 1 - ((n-1)/(n-p)(1-R^2)\).
In general, the interpretation of a slope in multiple regression can be tricky. Correlations among the predictors can change the slope values dramatically from what they would be in separate simple regressions.

R Concepts Worth Noting

The first step in exploratory data analysis is plotting the data. Create a scatter plot matrix using the pairs function.
Create a panel of plots using the grid.arrange function.

Data Management for this Module

library(readr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(alr3)

## Warning: package 'alr3' was built under R version 3.4.3

## Loading required package: car

## Warning: package 'car' was built under R version 3.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# iqsize is a tab-delimited data file.
# Performance IQ scores, brain size, height, and weight of n=38 college students. 
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/iqsize.txt'
iqsize <- read.table(url, sep = '\t', header = TRUE)
#head(iqsize)
#summarise(iqsize, n=n())

# babybirds is a tab-delimited data file.
# Venillation, O2, and CO2 for sample of n=120 nestling bank swallows. 
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/babybirds.txt'
babybirds <- read.table(url, sep = '\t', header = TRUE)
#head(babybirds)
#summarise(babybirds, n=n())

# pastry is a tab-delimited data file.
# Taster rating, moisture level, and sweetness level for sample of n=16 pasteries. 
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/pastry.txt'
pastry <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(pastry)
#summarise(pastry, n=n())

# stat_females is a tab-delimited data file.
# Height (inches), mother's height, and fathers height for n=214 female college statistics students. 
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/stat_females.txt'
stat_females <- read.table(url, sep = '\t', header = TRUE)
#head(stat_females)
#summarise(stat_females, n=n())

# hospital_infct is a tab-delimited data file.
# Infection risk, lenghth of hospital stay, average patient age, and number of x-rays at n=113 hospitals.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/hospital_infct.txt'
hospital_infct <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(hospital_infct)
#summarise(hospital_infct, n=n())

# bodyfat is a tab-delimited data file.
# Body fat, triceps thickness, thigh circumference, and midarm circumference for sample of n=20 individuals.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/bodyfat.txt'
bodyfat <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(bodyfat)
#summarise(bodyfat, n=n())

5.1 Example on IQ and Physical Characteristics

Consider the research question, is brain size and body size predictive of intelligence? Data set iqsize contains performance IQ scores PIQ, brain size from MRI scans Size (count/10,000), height Height (inches), and weight weight (pounds) for n=38 college students.

A common way of investigating the relationships among all of the variables is by way of a “scatter plot matrix.” Height and Weight appear to be correlated, and so does Brain with Height and with Weight. PIQ appears to be correlated with Brain, but not so much with Height or Weight.

pairs(iqsize)

Run a regression PIQ ~ Brain + Height + Weight. What can we conclude?

R2=.2949, so 29.49% of the variation in intelligence is reduced by taking into account brain size, height and weight.
R-sq(adj)=.2327. This will come in handy for model selection.
The parameters for Brain (2.06, P=0.0009) and Height (-2.73, P=0.0330) are significantly different from 0, while Weight (5.6E=4, P=0.9978) is not.
The ANOVA F-test (P=0.0072) suggests that the model containing Brain, Height and Weight is more useful in predicting PIQ than not taking into account the three predictors.

iqsize.lm <- lm(PIQ ~ Brain + Height + Weight, data = iqsize)
summary(iqsize.lm)

## 
## Call:
## lm(formula = PIQ ~ Brain + Height + Weight, data = iqsize)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.74 -12.09  -3.84  14.17  51.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.114e+02  6.297e+01   1.768 0.085979 .  
## Brain        2.060e+00  5.634e-01   3.657 0.000856 ***
## Height      -2.732e+00  1.229e+00  -2.222 0.033034 *  
## Weight       5.599e-04  1.971e-01   0.003 0.997750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.79 on 34 degrees of freedom
## Multiple R-squared:  0.2949, Adjusted R-squared:  0.2327 
## F-statistic: 4.741 on 3 and 34 DF,  p-value: 0.007215

5.2 Example on Underground Air Quality

Consider the research question, do nestling bank swallows, which live in underground burrows, also alter how they breathe? Data set babybirds contains volume of air breathed per minute for each of 6 nestling bank swallows Vent, percentage of oxygen in the air O2, and percentage of carbon dioxide in the air CO2 for n=120 nestling bank swallows.

The scatterplot matrix suggests:

Insubstantial relationship between Vent and O2.
Relationship between Vent and CO2 is curved and with increasing error variance.
There is no correlation between the two explanatory variables O2 and CO2.

pairs(babybirds)

A reasonable first-order model with two quantitative predictors is \(y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + e_i\) where the independent error terms \(e_i\) follow a normal distribution with mean 0 and equal variance \(\sigma^2\).

From this equation we can conduct hypothesis tests on the parameter estimators, or calculate a confidence interval for a mean response at specified explantory variable values.

\(R^2\) is .2682, so only 26.82% of the variation in Vent is reduced by taking into account O2 and CO2.
The parameter estimator for O2 (-5.330, p=.408) is not significantly different from zero, but the parameter estimator for CO2 (31.103, p<.0001) is significantly different from 0.
The analysis of variance F-test (p<0.0001) suggests that the model containing O2 and CO2 is more useful in predicting Vent than not taking into account the two predictors.

The mean minute ventilation of all nestling bank swallows whose breathing air is comprised of 15% oxygen and 5% carbon dioxide is 161 with 95% confidence interval (130, 193).

babybirds.lm <- lm(Vent ~ O2 + CO2, data = babybirds)
summary(babybirds.lm)

## 
## Call:
## lm(formula = Vent ~ O2 + CO2, data = babybirds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -356.57  -96.50    8.72   84.68  422.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   85.901    106.006   0.810    0.419    
## O2            -5.330      6.425  -0.830    0.408    
## CO2           31.103      4.789   6.495  2.1e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 157.4 on 117 degrees of freedom
## Multiple R-squared:  0.2682, Adjusted R-squared:  0.2557 
## F-statistic: 21.44 on 2 and 117 DF,  p-value: 1.169e-08

predict(babybirds.lm, data.frame(O2=15, CO2=5), interval="confidence")

##        fit      lwr      upr
## 1 161.4647 129.9396 192.9898

5.3 The Multiple Linear Regression Model

The concepts for the MLR model are similar to those of the SLR. One addition is the adjusted \(R^2\) which does not necessarily increase as more predictors are added, can be useful to identify which predictors should be included in a model. \(Adj R^2 = 1 - ((n-1)/(n-p)(1-R^2)\).

In general, the interpretation of a slope in multiple regression can be tricky. Correlations among the predictors can change the slope values dramatically from what they would be in separate simple regressions.

5.4 A Matrix Formulation of the Multiple Regression Model

In matrix notation, \(b=(X'X)^{-1}X'Y\).

5.5 Further Examples

Example 1: Pastry Sweetness Data

Consider the research question, how does moisture content and sweetness of a pastry affect a taster’s rating of the productd? Data set pastry contains taster rating Rating, moisture level Moisture (4 levels), and sweetness level Sweetness (2 levels) for n=16 pasteries.

Here the best way to look at the categorical level is with overlayed scatterplots. There is a linear relationship between rating and moisture and there is also a sweetness difference.

pastry.s2 <- pastry %>% filter(Sweetness == 2)
pastry.s4 <- pastry %>% filter(Sweetness == 4)
ggplot(pastry.s2, aes(y=Rating, x=Moisture)) +
  geom_point(aes(y=pastry.s2$Rating, x=pastry.s2$Moisture), 
             color="red") +
  geom_smooth(method = "lm", data=pastry.s2, se = FALSE, colour="red", 
              linetype=2, size=.5) +
  geom_point(aes(y=pastry.s4$Rating, x=pastry.s4$Moisture), 
             color="blue") +
  geom_smooth(method = "lm", data=pastry.s4, se = FALSE, colour="blue", 
              linetype=2, size=.5)

Here are linear regression models, two simple ones for Moisture and for Sweetness, and one multiple regression.

The Moisture coefficient is 4.4250 in both the SLR and MLR, and the Sweetness coefficient is 4.375 in both the SLR and MLR. This only happens because Moisture and Sweetness are not correlated, so the estimated slopes are independent of each other.
The R2 for the MLR, .9521, is the sum of the R2 for the two SLR’s (.7964 and .1557), again because Moisture and Sweetness are not correlated.
Sweetness is not statistically significant in the SLR (p=0.13), but it is in the MLR. Including both variables in the equation reduced the standard deviation of the residuals (notice the S values). This in turn reduces the standard errors of the coefficients, leading to greater t-values and smaller p-values.

pastry.lm1 <- lm(Rating ~ Moisture, data = pastry)
pastry.lm2 <- lm(Rating ~ Sweetness, data = pastry)
pastry.lm3 <- lm(Rating ~ Moisture + Sweetness, data = pastry)
summary(pastry.lm1)

## 
## Call:
## lm(formula = Rating ~ Moisture, data = pastry)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.475 -4.688 -0.100  4.638  7.525 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   50.775      4.395  11.554 1.52e-08 ***
## Moisture       4.425      0.598   7.399 3.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.349 on 14 degrees of freedom
## Multiple R-squared:  0.7964, Adjusted R-squared:  0.7818 
## F-statistic: 54.75 on 1 and 14 DF,  p-value: 3.356e-06

summary(pastry.lm2)

## 
## Call:
## lm(formula = Rating ~ Sweetness, data = pastry)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.375  -7.312  -0.125   8.688  16.625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   68.625      8.610   7.970 1.43e-06 ***
## Sweetness      4.375      2.723   1.607     0.13    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.89 on 14 degrees of freedom
## Multiple R-squared:  0.1557, Adjusted R-squared:  0.09539 
## F-statistic: 2.582 on 1 and 14 DF,  p-value: 0.1304

summary(pastry.lm3)

## 
## Call:
## lm(formula = Rating ~ Moisture + Sweetness, data = pastry)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.400 -1.762  0.025  1.587  4.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
## Moisture      4.4250     0.3011  14.695 1.78e-09 ***
## Sweetness     4.3750     0.6733   6.498 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared:  0.9521, Adjusted R-squared:  0.9447 
## F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09

Example 2: Female Stat Students

Consider the research question, are female heights more closely predicted by mother’s height or by father’s height? Data set stat_females contains heights of female college students Height (inches), and mother’s height momheight and father’s height dadheight for n=214 students.

A side-by-side scatterplot shows a moderate positive association with a straight-line pattern and no notable outliers for both momheight and dadheight explanatory variables.

stat_females.plot1 <- ggplot(stat_females, aes(y=Height, x=momheight)) +
  geom_point() + 
  xlab("Student Height (inches)") +
  ggtitle("Mom's Height (inches)")
stat_females.plot2 <- ggplot(stat_females, aes(y=Height, x=dadheight)) +
  geom_point() + 
  xlab("Student Height (inches)") +
  ggtitle("Dad's Height (inches)")
grid.arrange(stat_females.plot1, stat_females.plot2, ncol=2)

Regress Height ~ momheight + dadheight. Evaluate the model assumptions with a residuals plot. It looks about as it should - a random horizontal band of points. Other residual analyses can be done exactly as we did for simple regression. For instance, we might wish to examine a normal probability plot of the residuals. Additional plots to consider are plots of residuals versus each x-variable separately. This might help us identify sources of curvature or non-constant variance.

stat_females.lm <- lm(Height ~ momheight + dadheight, data = stat_females)
stat_females.res <- data.frame(res=resid(stat_females.lm))
stat_females.fit <- data.frame(fit=fitted(stat_females.lm))
stat_females <- cbind(stat_females, stat_females.res, stat_females.fit)
summary(stat_females.lm)

## 
## Call:
## lm(formula = Height ~ momheight + dadheight, data = stat_females)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2748 -1.5562 -0.0372  1.4721  5.7993 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.54725    3.69278   5.023 1.08e-06 ***
## momheight    0.30351    0.05446   5.573 7.61e-08 ***
## dadheight    0.38786    0.04721   8.216 2.10e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.031 on 211 degrees of freedom
## Multiple R-squared:  0.4335, Adjusted R-squared:  0.4281 
## F-statistic: 80.73 on 2 and 211 DF,  p-value: < 2.2e-16

ggplot(data=stat_females, aes(y = res, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  xlab("Fitted Value") +
  ylab("Residual") +
  ggtitle("Residuals vs. Fits Plot")

Example 3: Hospital Data

Consider the research question, what is the best predictor of infection risk in a hospital? Data set hospital_infct contains infection risk InfctRsk, length of stay Stay, average age Age, and number of xrays Xray for n=113 hospitals.

Regress InfctRsk ~ Stay + Age + Xray. Here we just note that the Age coefficient is not statistically significant (p=.3301).

hospital_infct.lm <- lm(InfctRsk ~ Stay + Age + Xray, data = hospital_infct)
summary(hospital_infct.lm)

## 
## Call:
## lm(formula = InfctRsk ~ Stay + Age + Xray, data = hospital_infct)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77320 -0.73779 -0.03345  0.73308  2.56331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.001162   1.314724   0.761 0.448003    
## Stay         0.308181   0.059396   5.189 9.88e-07 ***
## Age         -0.023005   0.023516  -0.978 0.330098    
## Xray         0.019661   0.005759   3.414 0.000899 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.085 on 109 degrees of freedom
## Multiple R-squared:  0.363,  Adjusted R-squared:  0.3455 
## F-statistic:  20.7 on 3 and 109 DF,  p-value: 1.087e-10

Example 4: Physiological Measurements Data

Consider the research question, what is the best predictor of percent body fat? Data set bodyfat contains body fat Bodyfat, triceps thickness Triceps, thigh circumference Thigh, and midarm circumference Midarm for a sample of n=20 individuals.

Regress Bodyfat ~ Triceps + Thigh + Midarm.

bodyfat.lm <- lm(Bodyfat ~ Triceps + Thigh + Midarm, data = bodyfat)
summary(bodyfat.lm)

## 
## Call:
## lm(formula = Bodyfat ~ Triceps + Thigh + Midarm, data = bodyfat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7263 -1.6111  0.3923  1.4656  4.1277 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  117.085     99.782   1.173    0.258
## Triceps        4.334      3.016   1.437    0.170
## Thigh         -2.857      2.582  -1.106    0.285
## Midarm        -2.186      1.595  -1.370    0.190
## 
## Residual standard error: 2.48 on 16 degrees of freedom
## Multiple R-squared:  0.8014, Adjusted R-squared:  0.7641 
## F-statistic: 21.52 on 3 and 16 DF,  p-value: 7.343e-06