I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 3 in STAT 501 is SLR Estimation and Prediction. The SAS version of the code below is posted on my blog.

Overview

This lesson covers confidence intervals for the mean response \(\bar{y}\) and prediction intervals.

R Concepts Worth Noting

  • Create a confidence interval or prediction interval by fitting a model with lm, then creating a data frame data.frame for the explanatory variable test value, then running the predict function using the argument interval="confidence" or interval="prediction".

  • Graph the confidence interval and prediction interval in ggplot. ggplot shows the confidence interval with geom_smoth(method=lm, se=TRUE). To show the prediction interval, create a prediction interval and merge the results into the original data set, then plot the lwr and upr values.

Data Management for this Module

library(readr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
# skincancer is a fixed-width data file.
# Mortality due to skin cancer (number of deaths per 10 million people), and 
# U.S. state latitude and longitude at center of state for n=49 states.
skincancer <- read_fwf(
  file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/skincancer.txt", 
  skip = 1,
  fwf_widths(c(25, 9, 7, 5, 5), c('state','lat','mort','ocean','lon')))
## Parsed with column specification:
## cols(
##   state = col_character(),
##   lat = col_double(),
##   mort = col_integer(),
##   ocean = col_integer(),
##   lon = col_double()
## )
#head(skincancer)
#summarise(skincancer, n=n())
# hospitalinfct_reg1and2 is a tab-delimited data file.
# Infection risk (percent of patients who get an infection) and average length 
# of stay (in days) in a sample of n=58 hospitals in the east and north 
# central U.S.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/hospitalinfct_reg1and2.txt'
hospitalinfct_reg1and2 <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(hospitalinfct_reg1and2)
#summarise(hospitalinfct_reg1and2, n=n())

3.1 Research Questions

\(\hat y_h = b_0 + b_1x_h\) is the mean response \(\mu_Y\) when the predictor value is \(x_h\), and it is the best guess of a new response \(y_{new}\) when the predictor value is \(x_h\).

For example,

The problem with the answers to the two research questions is that we would have obtained a different answer if we had selected a different random sample. There should be a confidence interval for \(\mu_y\) and a prediction interval for \(y_{new}\).

3.2 Confidence Interval for the Mean Response

The general formula for a confidence interval is always \(CI = \bar{y}_h \pm t_{\alpha/2,n-2} \times SE\). In the case of developing a confidence interval for the mean response, \(SE\) is the standard error of the fit. The formula for \(SE\) is \(SE = \sqrt{MSE \times (1/n + (x_h-\bar{x})^2 / \sum{(x_i-\bar{x})^2})}\). Notice the following:

Here is an example of a confidence interval for the mean response. Data set skincancer records one observation per U.S. state (n=49). The response variable mort is the mortality due to skin cancer (number of deaths per 10 million people) and the explanatory variable lat is the latitude (degrees North) at the center of each of 49 states in the U.S.

We can be 95% confident that the average skin cancer mortality rate of all locations at 40 degrees north is between 144.5617 and 155.6061 deaths per 10 million people.

skincancer.lm <- lm(mort ~ lat, data = skincancer)
newdata <- data.frame(lat=40)
predict(skincancer.lm, newdata, interval="confidence") 
##        fit      lwr      upr
## 1 150.0839 144.5617 155.6061

3.3 Prediction Interval for a New Response

In the case of developing a prediction interval for the new response, \(SE\) is the standard error of the fit. The formula for \(SE\) is \(SE = \sqrt{MSE \times (1 + 1/n + (x_h-\bar{x})^2 / \sum{(x_i-\bar{x})^2})}\). The formula is similar to that of the confidence interval with the exception of the new “1” term. Notice the following:

Revisit the skincancer example and request a prediction interval this time.

We can be 95% confident that the skin cancer mortality rate at an individual location at 40 degrees north will be between 111.235 and 188.9329 deaths per 10 million people.

predict(skincancer.lm, newdata, interval="predict")
##        fit     lwr      upr
## 1 150.0839 111.235 188.9329
# ggplot shows the confidence interveal with geom_smoth(method=lm, se=TRUE).
# To show the prediction interval, create a prediction interval and merge
# the results the original data set, then plot the lwr and upr values.
temp_var <- predict(skincancer.lm, interval="prediction")
## Warning in predict.lm(skincancer.lm, interval = "prediction"): predictions on current data refer to _future_ responses
skincancer.2 <- cbind(skincancer, temp_var)
ggplot (skincancer.2, aes(y=mort, x=lat)) +
  geom_point() +
  geom_smooth(method=lm, se=TRUE) +
  geom_line(aes(y=lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y=upr), color = "red", linetype = "dashed") +
  ylab("Mortality (Deaths per 10 million)") +
  xlab("Latitude (at center of state)") +
  ggtitle("Skin Cancer Mortality versus State Latitude")

3.4 Further Example

Consider an example of hospital infection risk. Data set hospitalinfct_reg1and2 is a sample of n=58 hospitals. The response variable InfctRsk is the percent of patients who get an infection and the explanatory varaible Stay is the average length of stay (in days).

hospitalinfct_reg1and2.lm <- lm(InfctRsk ~ Stay, data = hospitalinfct_reg1and2)
summary (hospitalinfct_reg1and2.lm)
## 
## Call:
## lm(formula = InfctRsk ~ Stay, data = hospitalinfct_reg1and2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6145 -0.4660  0.1388  0.4970  2.4310 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.15982    0.95580  -1.213     0.23    
## Stay         0.56887    0.09416   6.041  1.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 56 degrees of freedom
## Multiple R-squared:  0.3946, Adjusted R-squared:  0.3838 
## F-statistic:  36.5 on 1 and 56 DF,  p-value: 1.302e-07
predict(hospitalinfct_reg1and2.lm, data.frame(Stay=10), interval="confidence")
##        fit      lwr      upr
## 1 4.528846 4.259205 4.798486
predict(hospitalinfct_reg1and2.lm, data.frame(Stay=10), interval="prediction")
##        fit     lwr      upr
## 1 4.528846 2.45891 6.598781
temp_var <- predict(hospitalinfct_reg1and2.lm, interval="prediction")
## Warning in predict.lm(hospitalinfct_reg1and2.lm, interval = "prediction"): predictions on current data refer to _future_ responses
hospitalinfct_reg1and2.2 <- cbind(hospitalinfct_reg1and2, temp_var)
ggplot (hospitalinfct_reg1and2.2, aes(y=InfctRsk, x=Stay)) +
  geom_point() +
  geom_smooth(method=lm, se=TRUE) +
  geom_line(aes(y=lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y=upr), color = "red", linetype = "dashed") +
  ylab("Infection Rate (%)") +
  xlab("Length of Stay (Days)") +
  ggtitle("Hospital Infection Rate versus Length of Stay")

Observations: