7 MLR Estimation, Prediction & Model Assumptions

I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 7 in STAT 501 is MLR Estimation, Prediction & Model Assumptions. The SAS version of the code below is posted on my blog.

Overview

This lesson covers using the MLR to create a confidence interval for the mean response value and for the predicted value.

This lesson also covers evaluation of the model assumptions.

R Concepts Worth Noting

Data Management for this Module

library(readr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(alr3)

## Warning: package 'alr3' was built under R version 3.4.3

## Loading required package: car

## Warning: package 'car' was built under R version 3.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(reshape2)

## Warning: package 'reshape2' was built under R version 3.4.3

library(nortest)

# iqsize is a tab-delimited data file.
# PIQ, brain size, hight, and weight of a sample of n=38 students.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/iqsize.txt'
iqsize <- read.table(url, sep = '\t', header = TRUE)
#head(iqsize)
#summarise(iqsize, n=n())

# alcoholarm is a tab-delimited data file.
# Total lifetime consumption of alcohol and deltoid muscle strength of nondominant arm for n=50 alcoholic men.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/alcoholarm.txt'
alcoholarm <- read.table(url, sep = '\t', header = TRUE)
#head(alcoholarm)
#summarise(alcoholarm, n=n())

# allenstest is a tab-delimited data file.
# scores in Allen Cognitive Level (ACL), vocabulary, abstraction, and symbol-Digit Modalities Test (SDMT) for a sample of n=69 patients.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/allentest.txt'
allenstest <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(allenstest)
#summarise(allenstest, n=n())

# peru is a tab-delimited data file.
# Variables possibly relating to blood pressures of n = 39 Peruvians.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/peru.txt'
peru <- read.table(url, sep = '\t', header = TRUE)
peru$FracLife = peru$Years / peru$Age
#head(peru)
#summarise(peru, n=n())

# anscombe is a fixed-width data file.
# 
physical <- read_fwf(
  file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/Physical.txt", 
  skip = 1,
  fwf_widths(c(3, 9, 8, 6, 9, 7, 8, 7, 10, 6, 3), c('Obs', 'Sex', 'Height', 'LeftArm', 'RtArm', 'LeftFoot', 'RtFoot', 'LeftHand', 'RtHand', 'HeadCirc', 'nose')))

## Parsed with column specification:
## cols(
##   Obs = col_integer(),
##   Sex = col_character(),
##   Height = col_double(),
##   LeftArm = col_double(),
##   RtArm = col_double(),
##   LeftFoot = col_double(),
##   RtFoot = col_double(),
##   LeftHand = col_double(),
##   RtHand = col_double(),
##   HeadCirc = col_double(),
##   nose = col_double()
## )

#head(physical)
#summarise(physical, n=n())

7.1 - Confidence Interval for the Mean Response

The confidence interval for a mean response is \(\hat{y}_h \pm t_{(\alpha/2, n-p)} \times se(\hat{y}_h)\) where \(se(\hat{y}_h) = \sqrt{MSE(X_h^t(X^TX)^{-1}X_h)}\).

Here is an example with the iqsize data set where we fit the model PIQ ~ Brain + Height.

iqsize.lm <- lm(PIQ ~ Brain + Height, data = iqsize)
predict(iqsize.lm, data.frame(Brain=90, Height=70),
        interval="confidence")

##        fit      lwr     upr
## 1 105.6391 98.23722 113.041

“We can be 95% confident that the average performance IQ score of all college students with brain size = 90 and height = 70 is between 98.24 and 113.04 counts per 10,000.”

Note the following from the formula for the confidence interval:

The interval scales with the MSE. Recall the MSE is the natural variation of the response variable around the population regression hyperplane.
The interval decreases with increasing sample size n.
The interval decreases the closer \(X_h\) is to the average of the predictor values.

7.2 Prediction Interval for a New Response

The prediction interval for a mean response is \(\hat{y}_h \pm t_{(\alpha/2, n-p)} \times se)\) where \(se = \sqrt{MSE+se(\hat{y}_h)^2}\).

Here is an example with the iqsize data set where we fit the model PIQ ~ Brain + Height.

iqsize.lm <- lm(PIQ ~ Brain + Height, data = iqsize)
predict(iqsize.lm, data.frame(Brain=90, Height=70),
        interval="prediction")

##        fit      lwr      upr
## 1 105.6391 65.34688 145.9314

“We can be 95% confident that the performance IQ score of a college student with brain size = 90 and height = 70 is between 65.35 and 145.93 counts per 10,000.”

Note the following from the formula for the predictions interval:

The interval is very sensitive to the normality assumption.
While the confidence interval can approach 0 with increasing n, the prediction interval cannot.

7.3 MLR Model Assumptions

The four “LINE” conditions of the SLR also apply to the MLR. The errors \(e_i\) are independent normal random variables with mean zero and constant variance \(\sigma^2\).

The mean response \(E(Y_i)\) is a linear function of each set of predictors \(X_i\). I.e., \(E(e_i)=0\).
The errors \(e_i\) are independent.
The errors \(e_i\) are normally distributed at each set of predictors \(X_i\).
The errors \(e_i\) have equal variances \(\sigma^2\) at each set of predictors \(X_i\).

7.4 Assessing the Model Assumptions

The SLR model assumption assessment methods also apply to MLR:

Residuals vs fitted values plot to visually assess whether:
- the average of the residuals is ~0 at all fitted values (affirms linearity).
- the spread of the residuals is ~constant at all fitted values (affirms equal variances).
- there are no excessively outlying points.
- (violation of any of these three may necessitate remedial action (such as transforming one or more predictors and/or the response variable)
Residuals vs order plot (if observations were collected over time (or space)) and assess whether:
- there is no systematic non-random pattern (affirms independence).
- (violation may necessitate a time series model)
Residuals vs Predictors scatterplot matrix to assess whether:
- the average of the residuals is ~0 at all fitted values (affirms linearity).
- the spread of the residuals is ~constant at all fitted values (affirms equal variances).
- (violation of any of these three may necessitate remedial action (such as transforming one or more predictors and/or the response variable)
Histogram, boxplot, and/or normal probability plot of residuals
- approximate normality (affirms normality).
Residuals vs Omitted Predictors scatterplot matrix to assess whether:
- there are no strong linear or simple nonlinear trends in the plot.
- violation may indicate the predictor in question (or a transformation of the predictor) might be usefully added to the model.
- it can sometimes be helpful to plot functions of predictor variables (e.g. interaction terms consisting of one quantitative predictor multiplied by another quantitative predictor). A strong linear or simple nonlinear trend in the resulting plot may indicate the variable might be usefully added to the model.

Consider the iqsize regression of PIQ ~ Brain Height. Here is a residuals vs fits plot. The average of the residuals ~0, the variation of the residuals ~constant, and there are no excessively outlying points.

iqsize.lm <- lm(PIQ ~ Brain + Height, data = iqsize)
iqsize.res <- data.frame(res=resid(iqsize.lm))
iqsize.fit <- data.frame(fit=fitted(iqsize.lm))
iqsize <- cbind(iqsize, iqsize.res, iqsize.fit)
#summary(iqsize.lm)
ggplot(data=iqsize, aes(y = res, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  xlab("Fitted Value") +
  ylab("Residual") +
  ggtitle("Residuals vs. Fits Plot \n(response is PIQ)")

There is no time (or space) variable in this dataset so the next plot we’ll consider is a residuals vs predictor scatterplot matrix, starting with Brain. The average of the residuals ~0, the variation of the residuals ~constant, and there are no excessively outlying points. Also, there is no strong nonlinear trend in this plot that might suggest a transformation of PIQ or Brain in this model.

ggplot(data=iqsize, aes(y = res, x = Brain)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  xlab("Brain") +
  ylab("Residual") +
  ggtitle("Residuals vs. Predictor \n(response is PIQ)")

The second residuals vs predictor scatterplot matrix is with Height. The average of the residuals ~0, the variation of the residuals ~constant, and there are no excessively outlying points. Also, there is no strong nonlinear trend in this plot that might suggest a transformation of PIQ or Brain in this model.

ggplot(data=iqsize, aes(y = res, x = Height)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  xlab("Height") +
  ylab("Residual") +
  ggtitle("Residuals vs. Predictor \n(response is PIQ)")

Next consider a residuals histogram. Although this doesn’t have the ideal bell-shaped appearance, given the small sample size there’s little to suggest violation of the normality assumption.

ggplot(data=iqsize, aes(x = res)) +
  geom_histogram(binwidth = 10) +
  xlab("Frequency") +
  ylab("Residual") +
  ggtitle("Residuals Histogram \n(response is PIQ)")

Since the appearance of a histogram can be strongly influenced by the choice of intervals for the bars, to confirm this we can also look at a normal probability plot of the residuals. Again, given the small sample size there’s little to suggest violation of the normality assumption.

qqnorm(iqsize$res); qqline(iqsize$res)

The final plot we’ll consider is a residuals vs omitted predictor plot for the excluded variable Weight. Since there is no strong linear or simple nonlinear trend in this plot, there is nothing to suggest that Weight might be usefully added to the model.

ggplot(data=iqsize, aes(y = res, x = Weight)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  xlab("Weight") +
  ylab("Residual") +
  ggtitle("Residuals vs. Predictor \n(response is PIQ)")

7.5 Tests for Error Normality

There are error normality tests, \(H_0\): normal.

Anderson-Darling Test

The Anderson-Darling Test measures the area between a fitted line (based on the chosen distribution) and a nonparametric step function (based on the plot points). Again, using the iqsize regression of PIQ ~ Brain + Height, the Anderson-Darling test p-value is .6857, so do not reject \(H_0\) of normality.

ad.test(iqsize$res)

## 
##  Anderson-Darling normality test
## 
## data:  iqsize$res
## A = 0.26206, p-value = 0.6857

Shapiro-Wilk Test

The Shapiro-Wilk Test compares its test statistic to a distribution. Using the iqsize regression of PIQ ~ Brain + Height, the Shapiro-Wilk test p-value is .5764, so do not reject \(H_0\) of normality.

shapiro.test(iqsize$res)

## 
##  Shapiro-Wilk normality test
## 
## data:  iqsize$res
## W = 0.976, p-value = 0.5764

Ryan-Joiner Test

The Ryan-Joiner Test compares its test statistic to a distribution. I don’t know of an R implementation of this test.

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov Test compares its test statistic to a distribution. I don’t know of an R implementation of this test.

7.6 Tests for Constant Error Variance

There are tests of constant variance. Visual test are usually sufficient, but these tests provide an added layer of justification.

F-test

Partition the residuals of observations into two groups - one consisting of residuals associated with the lowest predictor values and the other consisting of those belonging to the highest predictor values.

Modified Levene (Brown-Forsythe) Test

Group the residuals into g groups according to the values of the quantity on the horizontal axis of the residual plot.

Breusch-Pagan Test

The Breusch-Pagan test (also known as the Cook-Weisberg score test) is an alternative to the modified Levene test.

Bartlett’s Test

Bartlett’s test is highly sensitive to the normality assumption, so if the residuals do not appear normal (even after transformations), then this test should not be used. Instead, the Levene test is the alternative to Bartlett’s test that is less sensitive to departures from normality.

7.7 Data Transformations

Data transformations are used in regression to describe curvature and can sometimes be usefully employed to correct for violation of the assumptions in a multiple linear regression model.