I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 7 in STAT 501 is MLR Estimation, Prediction & Model Assumptions. The SAS version of the code below is posted on my blog.
This lesson covers using the MLR to create a confidence interval for the mean response value and for the predicted value.
This lesson also covers evaluation of the model assumptions.
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.4.3
library(nortest)
# iqsize is a tab-delimited data file.
# PIQ, brain size, hight, and weight of a sample of n=38 students.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/iqsize.txt'
iqsize <- read.table(url, sep = '\t', header = TRUE)
#head(iqsize)
#summarise(iqsize, n=n())
# alcoholarm is a tab-delimited data file.
# Total lifetime consumption of alcohol and deltoid muscle strength of nondominant arm for n=50 alcoholic men.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/alcoholarm.txt'
alcoholarm <- read.table(url, sep = '\t', header = TRUE)
#head(alcoholarm)
#summarise(alcoholarm, n=n())
# allenstest is a tab-delimited data file.
# scores in Allen Cognitive Level (ACL), vocabulary, abstraction, and symbol-Digit Modalities Test (SDMT) for a sample of n=69 patients.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/allentest.txt'
allenstest <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(allenstest)
#summarise(allenstest, n=n())
# peru is a tab-delimited data file.
# Variables possibly relating to blood pressures of n = 39 Peruvians.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/peru.txt'
peru <- read.table(url, sep = '\t', header = TRUE)
peru$FracLife = peru$Years / peru$Age
#head(peru)
#summarise(peru, n=n())
# anscombe is a fixed-width data file.
#
physical <- read_fwf(
file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/Physical.txt",
skip = 1,
fwf_widths(c(3, 9, 8, 6, 9, 7, 8, 7, 10, 6, 3), c('Obs', 'Sex', 'Height', 'LeftArm', 'RtArm', 'LeftFoot', 'RtFoot', 'LeftHand', 'RtHand', 'HeadCirc', 'nose')))
## Parsed with column specification:
## cols(
## Obs = col_integer(),
## Sex = col_character(),
## Height = col_double(),
## LeftArm = col_double(),
## RtArm = col_double(),
## LeftFoot = col_double(),
## RtFoot = col_double(),
## LeftHand = col_double(),
## RtHand = col_double(),
## HeadCirc = col_double(),
## nose = col_double()
## )
#head(physical)
#summarise(physical, n=n())
The confidence interval for a mean response is \(\hat{y}_h \pm t_{(\alpha/2, n-p)} \times se(\hat{y}_h)\) where \(se(\hat{y}_h) = \sqrt{MSE(X_h^t(X^TX)^{-1}X_h)}\).
Here is an example with the iqsize
data set where we fit the model PIQ ~ Brain + Height.
iqsize.lm <- lm(PIQ ~ Brain + Height, data = iqsize)
predict(iqsize.lm, data.frame(Brain=90, Height=70),
interval="confidence")
## fit lwr upr
## 1 105.6391 98.23722 113.041
“We can be 95% confident that the average performance IQ score of all college students with brain size = 90 and height = 70 is between 98.24 and 113.04 counts per 10,000.”
Note the following from the formula for the confidence interval:The prediction interval for a mean response is \(\hat{y}_h \pm t_{(\alpha/2, n-p)} \times se)\) where \(se = \sqrt{MSE+se(\hat{y}_h)^2}\).
Here is an example with the iqsize
data set where we fit the model PIQ ~ Brain + Height.
iqsize.lm <- lm(PIQ ~ Brain + Height, data = iqsize)
predict(iqsize.lm, data.frame(Brain=90, Height=70),
interval="prediction")
## fit lwr upr
## 1 105.6391 65.34688 145.9314
“We can be 95% confident that the performance IQ score of a college student with brain size = 90 and height = 70 is between 65.35 and 145.93 counts per 10,000.”
Note the following from the formula for the predictions interval:The mean response \(E(Y_i)\) is a linear function of each set of predictors \(X_i\). I.e., \(E(e_i)=0\).
The errors \(e_i\) are independent.
The errors \(e_i\) are normally distributed at each set of predictors \(X_i\).
Consider the iqsize
regression of PIQ ~ Brain Height. Here is a residuals vs fits plot. The average of the residuals ~0, the variation of the residuals ~constant, and there are no excessively outlying points.
iqsize.lm <- lm(PIQ ~ Brain + Height, data = iqsize)
iqsize.res <- data.frame(res=resid(iqsize.lm))
iqsize.fit <- data.frame(fit=fitted(iqsize.lm))
iqsize <- cbind(iqsize, iqsize.res, iqsize.fit)
#summary(iqsize.lm)
ggplot(data=iqsize, aes(y = res, x = fit)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
xlab("Fitted Value") +
ylab("Residual") +
ggtitle("Residuals vs. Fits Plot \n(response is PIQ)")
There is no time (or space) variable in this dataset so the next plot we’ll consider is a residuals vs predictor scatterplot matrix, starting with Brain. The average of the residuals ~0, the variation of the residuals ~constant, and there are no excessively outlying points. Also, there is no strong nonlinear trend in this plot that might suggest a transformation of PIQ or Brain in this model.
ggplot(data=iqsize, aes(y = res, x = Brain)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
xlab("Brain") +
ylab("Residual") +
ggtitle("Residuals vs. Predictor \n(response is PIQ)")
The second residuals vs predictor scatterplot matrix is with Height. The average of the residuals ~0, the variation of the residuals ~constant, and there are no excessively outlying points. Also, there is no strong nonlinear trend in this plot that might suggest a transformation of PIQ or Brain in this model.
ggplot(data=iqsize, aes(y = res, x = Height)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
xlab("Height") +
ylab("Residual") +
ggtitle("Residuals vs. Predictor \n(response is PIQ)")
Next consider a residuals histogram. Although this doesn’t have the ideal bell-shaped appearance, given the small sample size there’s little to suggest violation of the normality assumption.
ggplot(data=iqsize, aes(x = res)) +
geom_histogram(binwidth = 10) +
xlab("Frequency") +
ylab("Residual") +
ggtitle("Residuals Histogram \n(response is PIQ)")
Since the appearance of a histogram can be strongly influenced by the choice of intervals for the bars, to confirm this we can also look at a normal probability plot of the residuals. Again, given the small sample size there’s little to suggest violation of the normality assumption.
qqnorm(iqsize$res); qqline(iqsize$res)
The final plot we’ll consider is a residuals vs omitted predictor plot for the excluded variable Weight
. Since there is no strong linear or simple nonlinear trend in this plot, there is nothing to suggest that Weight might be usefully added to the model.
ggplot(data=iqsize, aes(y = res, x = Weight)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
xlab("Weight") +
ylab("Residual") +
ggtitle("Residuals vs. Predictor \n(response is PIQ)")
There are error normality tests, \(H_0\): normal.
The Anderson-Darling Test measures the area between a fitted line (based on the chosen distribution) and a nonparametric step function (based on the plot points). Again, using the iqsize
regression of PIQ ~ Brain + Height, the Anderson-Darling test p-value is .6857, so do not reject \(H_0\) of normality.
ad.test(iqsize$res)
##
## Anderson-Darling normality test
##
## data: iqsize$res
## A = 0.26206, p-value = 0.6857
The Shapiro-Wilk Test compares its test statistic to a distribution. Using the iqsize
regression of PIQ ~ Brain + Height, the Shapiro-Wilk test p-value is .5764, so do not reject \(H_0\) of normality.
shapiro.test(iqsize$res)
##
## Shapiro-Wilk normality test
##
## data: iqsize$res
## W = 0.976, p-value = 0.5764
The Ryan-Joiner Test compares its test statistic to a distribution. I don’t know of an R implementation of this test.
The Kolmogorov-Smirnov Test compares its test statistic to a distribution. I don’t know of an R implementation of this test.
There are tests of constant variance. Visual test are usually sufficient, but these tests provide an added layer of justification.
Partition the residuals of observations into two groups - one consisting of residuals associated with the lowest predictor values and the other consisting of those belonging to the highest predictor values.
Group the residuals into g groups according to the values of the quantity on the horizontal axis of the residual plot.
The Breusch-Pagan test (also known as the Cook-Weisberg score test) is an alternative to the modified Levene test.
Bartlett’s test is highly sensitive to the normality assumption, so if the residuals do not appear normal (even after transformations), then this test should not be used. Instead, the Levene test is the alternative to Bartlett’s test that is less sensitive to departures from normality.
Data transformations are used in regression to describe curvature and can sometimes be usefully employed to correct for violation of the assumptions in a multiple linear regression model.