I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 5 in STAT 501 is Multiple Linear Regression. The SAS version of the code below is posted on my blog.
This lesson extends the concepts of SLR to multiple Linear Regression (MLR).
The first step in exploratory data analysis is plotting the data. Create a scatter plot matrix.
The concepts for the MLR model are similar to those of the SLR. One addition is the adjusted \(R^2\) which does not necessarily increase as more predictors are added, can be useful to identify which predictors should be included in a model. \(Adj R^2 = 1 - ((n-1)/(n-p)(1-R^2)\).
In general, the interpretation of a slope in multiple regression can be tricky. Correlations among the predictors can change the slope values dramatically from what they would be in separate simple regressions.
The first step in exploratory data analysis is plotting the data. Create a scatter plot matrix using the pairs
function.
Create a panel of plots using the grid.arrange
function.
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# iqsize is a tab-delimited data file.
# Performance IQ scores, brain size, height, and weight of n=38 college students.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/iqsize.txt'
iqsize <- read.table(url, sep = '\t', header = TRUE)
#head(iqsize)
#summarise(iqsize, n=n())
# babybirds is a tab-delimited data file.
# Venillation, O2, and CO2 for sample of n=120 nestling bank swallows.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/babybirds.txt'
babybirds <- read.table(url, sep = '\t', header = TRUE)
#head(babybirds)
#summarise(babybirds, n=n())
# pastry is a tab-delimited data file.
# Taster rating, moisture level, and sweetness level for sample of n=16 pasteries.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/pastry.txt'
pastry <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(pastry)
#summarise(pastry, n=n())
# stat_females is a tab-delimited data file.
# Height (inches), mother's height, and fathers height for n=214 female college statistics students.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/stat_females.txt'
stat_females <- read.table(url, sep = '\t', header = TRUE)
#head(stat_females)
#summarise(stat_females, n=n())
# hospital_infct is a tab-delimited data file.
# Infection risk, lenghth of hospital stay, average patient age, and number of x-rays at n=113 hospitals.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/hospital_infct.txt'
hospital_infct <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(hospital_infct)
#summarise(hospital_infct, n=n())
# bodyfat is a tab-delimited data file.
# Body fat, triceps thickness, thigh circumference, and midarm circumference for sample of n=20 individuals.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/bodyfat.txt'
bodyfat <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(bodyfat)
#summarise(bodyfat, n=n())
Consider the research question, is brain size and body size predictive of intelligence? Data set iqsize
contains performance IQ scores PIQ
, brain size from MRI scans Size
(count/10,000), height Height
(inches), and weight weight
(pounds) for n=38 college students.
A common way of investigating the relationships among all of the variables is by way of a “scatter plot matrix.” Height
and Weight
appear to be correlated, and so does Brain
with Height
and with Weight
. PIQ
appears to be correlated with Brain
, but not so much with Height
or Weight
.
pairs(iqsize)
PIQ
~ Brain
+ Height
+ Weight
. What can we conclude?R2=.2949, so 29.49% of the variation in intelligence is reduced by taking into account brain size, height and weight.
R-sq(adj)=.2327. This will come in handy for model selection.
The parameters for Brain
(2.06, P=0.0009) and Height
(-2.73, P=0.0330) are significantly different from 0, while Weight
(5.6E=4, P=0.9978) is not.
Brain
, Height
and Weight
is more useful in predicting PIQ
than not taking into account the three predictors.
iqsize.lm <- lm(PIQ ~ Brain + Height + Weight, data = iqsize)
summary(iqsize.lm)
##
## Call:
## lm(formula = PIQ ~ Brain + Height + Weight, data = iqsize)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.74 -12.09 -3.84 14.17 51.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.114e+02 6.297e+01 1.768 0.085979 .
## Brain 2.060e+00 5.634e-01 3.657 0.000856 ***
## Height -2.732e+00 1.229e+00 -2.222 0.033034 *
## Weight 5.599e-04 1.971e-01 0.003 0.997750
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.79 on 34 degrees of freedom
## Multiple R-squared: 0.2949, Adjusted R-squared: 0.2327
## F-statistic: 4.741 on 3 and 34 DF, p-value: 0.007215
Consider the research question, do nestling bank swallows, which live in underground burrows, also alter how they breathe? Data set babybirds
contains volume of air breathed per minute for each of 6 nestling bank swallows Vent
, percentage of oxygen in the air O2
, and percentage of carbon dioxide in the air CO2
for n=120 nestling bank swallows.
Vent
and O2
.
Vent
and CO2
is curved and with increasing error variance.
O2
and CO2
.
pairs(babybirds)
A reasonable first-order model with two quantitative predictors is \(y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + e_i\) where the independent error terms \(e_i\) follow a normal distribution with mean 0 and equal variance \(\sigma^2\).
From this equation we can conduct hypothesis tests on the parameter estimators, or calculate a confidence interval for a mean response at specified explantory variable values.
\(R^2\) is .2682, so only 26.82% of the variation in Vent
is reduced by taking into account O2
and CO2
.
The parameter estimator for O2
(-5.330, p=.408) is not significantly different from zero, but the parameter estimator for CO2
(31.103, p<.0001) is significantly different from 0.
The analysis of variance F-test (p<0.0001) suggests that the model containing O2
and CO2
is more useful in predicting Vent
than not taking into account the two predictors.
The mean minute ventilation of all nestling bank swallows whose breathing air is comprised of 15% oxygen and 5% carbon dioxide is 161 with 95% confidence interval (130, 193).
babybirds.lm <- lm(Vent ~ O2 + CO2, data = babybirds)
summary(babybirds.lm)
##
## Call:
## lm(formula = Vent ~ O2 + CO2, data = babybirds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -356.57 -96.50 8.72 84.68 422.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.901 106.006 0.810 0.419
## O2 -5.330 6.425 -0.830 0.408
## CO2 31.103 4.789 6.495 2.1e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.4 on 117 degrees of freedom
## Multiple R-squared: 0.2682, Adjusted R-squared: 0.2557
## F-statistic: 21.44 on 2 and 117 DF, p-value: 1.169e-08
predict(babybirds.lm, data.frame(O2=15, CO2=5), interval="confidence")
## fit lwr upr
## 1 161.4647 129.9396 192.9898
The concepts for the MLR model are similar to those of the SLR. One addition is the adjusted \(R^2\) which does not necessarily increase as more predictors are added, can be useful to identify which predictors should be included in a model. \(Adj R^2 = 1 - ((n-1)/(n-p)(1-R^2)\).
In general, the interpretation of a slope in multiple regression can be tricky. Correlations among the predictors can change the slope values dramatically from what they would be in separate simple regressions.
In matrix notation, \(b=(X'X)^{-1}X'Y\).
Consider the research question, how does moisture content and sweetness of a pastry affect a taster’s rating of the productd? Data set pastry
contains taster rating Rating
, moisture level Moisture
(4 levels), and sweetness level Sweetness
(2 levels) for n=16 pasteries.
Here the best way to look at the categorical level is with overlayed scatterplots. There is a linear relationship between rating and moisture and there is also a sweetness difference.
pastry.s2 <- pastry %>% filter(Sweetness == 2)
pastry.s4 <- pastry %>% filter(Sweetness == 4)
ggplot(pastry.s2, aes(y=Rating, x=Moisture)) +
geom_point(aes(y=pastry.s2$Rating, x=pastry.s2$Moisture),
color="red") +
geom_smooth(method = "lm", data=pastry.s2, se = FALSE, colour="red",
linetype=2, size=.5) +
geom_point(aes(y=pastry.s4$Rating, x=pastry.s4$Moisture),
color="blue") +
geom_smooth(method = "lm", data=pastry.s4, se = FALSE, colour="blue",
linetype=2, size=.5)
Here are linear regression models, two simple ones for Moisture
and for Sweetness
, and one multiple regression.
The Moisture
coefficient is 4.4250 in both the SLR and MLR, and the Sweetness
coefficient is 4.375 in both the SLR and MLR. This only happens because Moisture
and Sweetness
are not correlated, so the estimated slopes are independent of each other.
The R2 for the MLR, .9521, is the sum of the R2 for the two SLR’s (.7964 and .1557), again because Moisture
and Sweetness
are not correlated.
Sweetness
is not statistically significant in the SLR (p=0.13), but it is in the MLR. Including both variables in the equation reduced the standard deviation of the residuals (notice the S values). This in turn reduces the standard errors of the coefficients, leading to greater t-values and smaller p-values.
pastry.lm1 <- lm(Rating ~ Moisture, data = pastry)
pastry.lm2 <- lm(Rating ~ Sweetness, data = pastry)
pastry.lm3 <- lm(Rating ~ Moisture + Sweetness, data = pastry)
summary(pastry.lm1)
##
## Call:
## lm(formula = Rating ~ Moisture, data = pastry)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.475 -4.688 -0.100 4.638 7.525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.775 4.395 11.554 1.52e-08 ***
## Moisture 4.425 0.598 7.399 3.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.349 on 14 degrees of freedom
## Multiple R-squared: 0.7964, Adjusted R-squared: 0.7818
## F-statistic: 54.75 on 1 and 14 DF, p-value: 3.356e-06
summary(pastry.lm2)
##
## Call:
## lm(formula = Rating ~ Sweetness, data = pastry)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.375 -7.312 -0.125 8.688 16.625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.625 8.610 7.970 1.43e-06 ***
## Sweetness 4.375 2.723 1.607 0.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.89 on 14 degrees of freedom
## Multiple R-squared: 0.1557, Adjusted R-squared: 0.09539
## F-statistic: 2.582 on 1 and 14 DF, p-value: 0.1304
summary(pastry.lm3)
##
## Call:
## lm(formula = Rating ~ Moisture + Sweetness, data = pastry)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.400 -1.762 0.025 1.587 4.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.6500 2.9961 12.566 1.20e-08 ***
## Moisture 4.4250 0.3011 14.695 1.78e-09 ***
## Sweetness 4.3750 0.6733 6.498 2.01e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9447
## F-statistic: 129.1 on 2 and 13 DF, p-value: 2.658e-09
Consider the research question, are female heights more closely predicted by mother’s height or by father’s height? Data set stat_females
contains heights of female college students Height
(inches), and mother’s height momheight
and father’s height dadheight
for n=214 students.
A side-by-side scatterplot shows a moderate positive association with a straight-line pattern and no notable outliers for both momheight
and dadheight
explanatory variables.
stat_females.plot1 <- ggplot(stat_females, aes(y=Height, x=momheight)) +
geom_point() +
xlab("Student Height (inches)") +
ggtitle("Mom's Height (inches)")
stat_females.plot2 <- ggplot(stat_females, aes(y=Height, x=dadheight)) +
geom_point() +
xlab("Student Height (inches)") +
ggtitle("Dad's Height (inches)")
grid.arrange(stat_females.plot1, stat_females.plot2, ncol=2)
Regress Height
~ momheight
+ dadheight
. Evaluate the model assumptions with a residuals plot. It looks about as it should - a random horizontal band of points. Other residual analyses can be done exactly as we did for simple regression. For instance, we might wish to examine a normal probability plot of the residuals. Additional plots to consider are plots of residuals versus each x-variable separately. This might help us identify sources of curvature or non-constant variance.
stat_females.lm <- lm(Height ~ momheight + dadheight, data = stat_females)
stat_females.res <- data.frame(res=resid(stat_females.lm))
stat_females.fit <- data.frame(fit=fitted(stat_females.lm))
stat_females <- cbind(stat_females, stat_females.res, stat_females.fit)
summary(stat_females.lm)
##
## Call:
## lm(formula = Height ~ momheight + dadheight, data = stat_females)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2748 -1.5562 -0.0372 1.4721 5.7993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.54725 3.69278 5.023 1.08e-06 ***
## momheight 0.30351 0.05446 5.573 7.61e-08 ***
## dadheight 0.38786 0.04721 8.216 2.10e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.031 on 211 degrees of freedom
## Multiple R-squared: 0.4335, Adjusted R-squared: 0.4281
## F-statistic: 80.73 on 2 and 211 DF, p-value: < 2.2e-16
ggplot(data=stat_females, aes(y = res, x = fit)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
xlab("Fitted Value") +
ylab("Residual") +
ggtitle("Residuals vs. Fits Plot")
Consider the research question, what is the best predictor of infection risk in a hospital? Data set hospital_infct
contains infection risk InfctRsk
, length of stay Stay
, average age Age
, and number of xrays Xray
for n=113 hospitals.
Regress InfctRsk
~ Stay
+ Age
+ Xray
. Here we just note that the Age
coefficient is not statistically significant (p=.3301).
hospital_infct.lm <- lm(InfctRsk ~ Stay + Age + Xray, data = hospital_infct)
summary(hospital_infct.lm)
##
## Call:
## lm(formula = InfctRsk ~ Stay + Age + Xray, data = hospital_infct)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77320 -0.73779 -0.03345 0.73308 2.56331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.001162 1.314724 0.761 0.448003
## Stay 0.308181 0.059396 5.189 9.88e-07 ***
## Age -0.023005 0.023516 -0.978 0.330098
## Xray 0.019661 0.005759 3.414 0.000899 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.085 on 109 degrees of freedom
## Multiple R-squared: 0.363, Adjusted R-squared: 0.3455
## F-statistic: 20.7 on 3 and 109 DF, p-value: 1.087e-10
Consider the research question, what is the best predictor of percent body fat? Data set bodyfat
contains body fat Bodyfat
, triceps thickness Triceps
, thigh circumference Thigh
, and midarm circumference Midarm
for a sample of n=20 individuals.
Regress Bodyfat
~ Triceps
+ Thigh
+ Midarm
.
bodyfat.lm <- lm(Bodyfat ~ Triceps + Thigh + Midarm, data = bodyfat)
summary(bodyfat.lm)
##
## Call:
## lm(formula = Bodyfat ~ Triceps + Thigh + Midarm, data = bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7263 -1.6111 0.3923 1.4656 4.1277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.085 99.782 1.173 0.258
## Triceps 4.334 3.016 1.437 0.170
## Thigh -2.857 2.582 -1.106 0.285
## Midarm -2.186 1.595 -1.370 0.190
##
## Residual standard error: 2.48 on 16 degrees of freedom
## Multiple R-squared: 0.8014, Adjusted R-squared: 0.7641
## F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06