How do we use a regression model for prediction?

Once the model is fitted we can use it to make predictions on the value of the response variable \(Y\), for any given value of \(x=x_0\).

For this example we will continue to use the {rocket} data.

rocket<-read.csv("https://raw.githubusercontent.com/kitadasmalley/sp21_MATH376LMT/main/data/rocketProp.csv",
                 header=TRUE)
colnames(rocket)<-c("obs", "shear", "age")

# fitting and store the SLR model
mod<-lm(shear~age, rocket)

1. Finding a Predicted Value

We simple input \(x=x_0\) into the fitted equation \[\hat{\beta}_0+\hat{\beta}_1 x_0\]

# create a new data object using the name of the x variable 
# evaluate at age =15
newdata<-data.frame(age=15)

# use the predict function to evaluate the fitted equation at x
predict(mod, newdata)
##        1 
## 2070.518

2. Confidence Interval for the mean response

\[\mu_{Y|x_0}=E[Y|x_0]=\beta_0+\beta_1 x_0\]

The form of a \(100(1-\alpha)%\) confidence interval for the mean response is given by

\[(\hat{\beta}_0+\hat{\beta}_1 x_0)\pm t_{df=n-2, \alpha/2}^* \times \sqrt{MS(Res)\times (\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2})}\]

# confidence interval for x=15
predict(mod, newdata, interval="confidence")
##        fit      lwr      upr
## 1 2070.518 2024.289 2116.748

3. Prediction interval for a new observation

A confidence interval for the mean response does not reflecct the bounds in which we are realisitically expecting to observe a single new observation at \(x=x_0\). A single observation has more variablity the average of observations. Therefore, we account for that by changing the error term.

The form of a \(100(1-\alpha)%\) prediction interval for the mean response is given by

\[(\hat{\beta}_0+\hat{\beta}_1 x_0)\pm t_{df=n-2, \alpha/2}^* \times \sqrt{MS(Res)\times (1+\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2})}\]

# prediction interval for x=15
predict(mod, newdata, interval="predict")
##        fit      lwr      upr
## 1 2070.518 1863.382 2277.655

4. Using ggplot2 and Plotting Confidence and Prediction Bands

1. Create confidence intervals for each x value in the data
confBand<-predict(mod, interval="confidence")
2. Create prediction intervals for each x value in the data
predBand<-predict(mod, interval="predict")
## Warning in predict.lm(mod, interval = "predict"): predictions on current data refer to _future_ responses
3. Format the data into a new dataframe
# rename cols in predBand to be unique from confBand
colnames(predBand)<-c("fit2", "lwr2", "upr2")

# create a new dataframe with the data and new bands
newDF<-cbind(rocket, confBand, predBand)
4. Plot bands
# use tidyverse to make graphic of bands
# install.packages(“tidyverse”)
library(tidyverse)

ggplot(newDF, aes(x=age, y=shear))+
  geom_point()+
  geom_abline(slope=mod$coefficients[2], intercept=mod$coefficients[1],
              color="blue", lty=2, lwd=1)+
  geom_line(aes(y=lwr), color="green", lty=2, lwd=1)+
  geom_line(aes(y=upr), color="green", lty=2, lwd=1)+
  geom_line(aes(y=lwr2), color="red", lty=2, lwd=1)+
  geom_line(aes(y=upr2), color="red", lty=2, lwd=1)+
  theme_bw()

BONUS: We can check these values built into R

Caution: Be careful of order of operations

age<-rocket$age
shear<-rocket$shear

beta_0<-mod$coefficients[1]
beta_0
## (Intercept) 
##    2627.822
beta_1<-mod$coefficients[2]
beta_1
##       age 
## -37.15359
x_0<-15

pointEst<-beta_0+beta_1*x_0
pointEst
## (Intercept) 
##    2070.518
ss_res<-sum((shear-mod$fitted)^2)
n<-length(mod$residuals)
ms_res<-ss_res/(n-2)  

seConf<-sqrt(ms_res*(1 / n + (x_0 - mean(age))^2 / (sum((age - mean(age))^2))))
seConf 
## [1] 22.00456
pointEst+c(-1,1)*qt(0.975, n-2)*seConf
## [1] 2024.289 2116.748
sePred<-sqrt(ms_res*(1+(1 / n) + (x_0 - mean(age))^2 / (sum((age - mean(age))^2))))
sePred 
## [1] 98.59301
pointEst+c(-1,1)*qt(0.975, n-2)*sePred
## [1] 1863.382 2277.655