3 SLR Estimation and Prediction

I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 3 in STAT 501 is SLR Estimation and Prediction. The SAS version of the code below is posted on my blog.

Overview

This lesson covers confidence intervals for the mean response \(\bar{y}\) and prediction intervals.

For a given value of \(x\), the confidence interval is the interval estimate for the mean of the response variable \(\bar{y}\).
For a given value of \(x\), the prediction interval is the interval estimate of the response variable \(y\).

R Concepts Worth Noting

Create a confidence interval or prediction interval by fitting a model with lm, then creating a data frame data.frame for the explanatory variable test value, then running the predict function using the argument interval="confidence" or interval="prediction".
Graph the confidence interval and prediction interval in ggplot. ggplot shows the confidence interval with geom_smoth(method=lm, se=TRUE). To show the prediction interval, create a prediction interval and merge the results into the original data set, then plot the lwr and upr values.

Data Management for this Module

library(readr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(alr3)

## Warning: package 'alr3' was built under R version 3.4.3

## Loading required package: car

## Warning: package 'car' was built under R version 3.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# skincancer is a fixed-width data file.
# Mortality due to skin cancer (number of deaths per 10 million people), and 
# U.S. state latitude and longitude at center of state for n=49 states.
skincancer <- read_fwf(
  file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/skincancer.txt", 
  skip = 1,
  fwf_widths(c(25, 9, 7, 5, 5), c('state','lat','mort','ocean','lon')))

## Parsed with column specification:
## cols(
##   state = col_character(),
##   lat = col_double(),
##   mort = col_integer(),
##   ocean = col_integer(),
##   lon = col_double()
## )

#head(skincancer)
#summarise(skincancer, n=n())

# hospitalinfct_reg1and2 is a tab-delimited data file.
# Infection risk (percent of patients who get an infection) and average length 
# of stay (in days) in a sample of n=58 hospitals in the east and north 
# central U.S.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/hospitalinfct_reg1and2.txt'
hospitalinfct_reg1and2 <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(hospitalinfct_reg1and2)
#summarise(hospitalinfct_reg1and2, n=n())

3.1 Research Questions

\(\hat y_h = b_0 + b_1x_h\) is the mean response \(\mu_Y\) when the predictor value is \(x_h\), and it is the best guess of a new response \(y_{new}\) when the predictor value is \(x_h\).

For example,

The best estimate of the mean mortality rate due to skin cancer for all locations at 40 degrees north latitude is 389.19 - 5.97764(40) = 150 deaths per 10 million people.
The best prediction of the mortality rate due to skin cancer in Chambersburg, Pennsylvania is 389.19 - 5.97764(40) = 150 deaths per 10 million people.

The problem with the answers to the two research questions is that we would have obtained a different answer if we had selected a different random sample. There should be a confidence interval for \(\mu_y\) and a prediction interval for \(y_{new}\).

3.2 Confidence Interval for the Mean Response

The general formula for a confidence interval is always \(CI = \bar{y}_h \pm t_{\alpha/2,n-2} \times SE\). In the case of developing a confidence interval for the mean response, \(SE\) is the standard error of the fit. The formula for \(SE\) is \(SE = \sqrt{MSE \times (1/n + (x_h-\bar{x})^2 / \sum{(x_i-\bar{x})^2})}\). Notice the following:

As \(MSE\) decreases, the width of the interval decreases. I.e., the less the data vary naturally around the unknown population regression line, the tighter the confidence interval.
As \(\alpha\) increases, the width of the interval decreases.
As \(n\) increases, the width of the interval decreases.
The more spread out the predictor values, the larger is \(\sum{(x_i-\bar{x})^2}\), so the width of the interval decreases.
The closer \(x_h\) is to \(\bar{x}\), the smaller is \((x_h-\bar{x})^2\), so the width of the interval decreases.

Here is an example of a confidence interval for the mean response. Data set skincancer records one observation per U.S. state (n=49). The response variable mort is the mortality due to skin cancer (number of deaths per 10 million people) and the explanatory variable lat is the latitude (degrees North) at the center of each of 49 states in the U.S.

We can be 95% confident that the average skin cancer mortality rate of all locations at 40 degrees north is between 144.5617 and 155.6061 deaths per 10 million people.

skincancer.lm <- lm(mort ~ lat, data = skincancer)
newdata <- data.frame(lat=40)
predict(skincancer.lm, newdata, interval="confidence")

##        fit      lwr      upr
## 1 150.0839 144.5617 155.6061

3.3 Prediction Interval for a New Response

In the case of developing a prediction interval for the new response, \(SE\) is the standard error of the fit. The formula for \(SE\) is \(SE = \sqrt{MSE \times (1 + 1/n + (x_h-\bar{x})^2 / \sum{(x_i-\bar{x})^2})}\). The formula is similar to that of the confidence interval with the exception of the new “1” term. Notice the following:

The \((1-\alpha)\) confidence interval for \(\mu_Y\) at \(x_h\) will always be narrower than the corresponding \((1-\alpha)\) prediction interval for \(y_{new}\) at \(x_h\).
By calculating the interval at the sample’s mean of the predictor values (\(x_h=\bar{x})\) and increasing the sample size \(n\), the confidence interval’s standard error can approach 0, but the prediction interval’s standard error cannot get close to 0.

Revisit the skincancer example and request a prediction interval this time.

We can be 95% confident that the skin cancer mortality rate at an individual location at 40 degrees north will be between 111.235 and 188.9329 deaths per 10 million people.

predict(skincancer.lm, newdata, interval="predict")

##        fit     lwr      upr
## 1 150.0839 111.235 188.9329

# ggplot shows the confidence interveal with geom_smoth(method=lm, se=TRUE).
# To show the prediction interval, create a prediction interval and merge
# the results the original data set, then plot the lwr and upr values.
temp_var <- predict(skincancer.lm, interval="prediction")

## Warning in predict.lm(skincancer.lm, interval = "prediction"): predictions on current data refer to _future_ responses

skincancer.2 <- cbind(skincancer, temp_var)
ggplot (skincancer.2, aes(y=mort, x=lat)) +
  geom_point() +
  geom_smooth(method=lm, se=TRUE) +
  geom_line(aes(y=lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y=upr), color = "red", linetype = "dashed") +
  ylab("Mortality (Deaths per 10 million)") +
  xlab("Latitude (at center of state)") +
  ggtitle("Skin Cancer Mortality versus State Latitude")

3.4 Further Example

Consider an example of hospital infection risk. Data set hospitalinfct_reg1and2 is a sample of n=58 hospitals. The response variable InfctRsk is the percent of patients who get an infection and the explanatory varaible Stay is the average length of stay (in days).

hospitalinfct_reg1and2.lm <- lm(InfctRsk ~ Stay, data = hospitalinfct_reg1and2)
summary (hospitalinfct_reg1and2.lm)

## 
## Call:
## lm(formula = InfctRsk ~ Stay, data = hospitalinfct_reg1and2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6145 -0.4660  0.1388  0.4970  2.4310 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.15982    0.95580  -1.213     0.23    
## Stay         0.56887    0.09416   6.041  1.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 56 degrees of freedom
## Multiple R-squared:  0.3946, Adjusted R-squared:  0.3838 
## F-statistic:  36.5 on 1 and 56 DF,  p-value: 1.302e-07

predict(hospitalinfct_reg1and2.lm, data.frame(Stay=10), interval="confidence")

##        fit      lwr      upr
## 1 4.528846 4.259205 4.798486

predict(hospitalinfct_reg1and2.lm, data.frame(Stay=10), interval="prediction")

##        fit     lwr      upr
## 1 4.528846 2.45891 6.598781

temp_var <- predict(hospitalinfct_reg1and2.lm, interval="prediction")

## Warning in predict.lm(hospitalinfct_reg1and2.lm, interval = "prediction"): predictions on current data refer to _future_ responses

hospitalinfct_reg1and2.2 <- cbind(hospitalinfct_reg1and2, temp_var)
ggplot (hospitalinfct_reg1and2.2, aes(y=InfctRsk, x=Stay)) +
  geom_point() +
  geom_smooth(method=lm, se=TRUE) +
  geom_line(aes(y=lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y=upr), color = "red", linetype = "dashed") +
  ylab("Infection Rate (%)") +
  xlab("Length of Stay (Days)") +
  ggtitle("Hospital Infection Rate versus Length of Stay")

Observations:

We estimate with 95% confidence that in hospitals in which the average length of stay is 10 days, the mean infection risk is between 4.259205 and 4.798486.
We estimate with 95% confidence that for any future hospital where the average length of stay is 10 days, the infection risk is between 2.45891 and 6.598781.
The fit value is the regression line \(y(10) = -1.15982 + 0.56887(10)=4.528846\).
Notice the confidence limits for \(E(Y)\) are close to the regression line.
Notice the prediction limits bracket most of the data.