Textbook: Linear Models with R, 2nd, Julian J. Faraway

Prediction

data(fat,package="faraway")
Dataset: Percentage of Body Fat and Body Measurements

Description

Age, weight, height, and 10 body circumference measurements are recorded for 252 men. Each man’s percentage of body fat was accurately estimated by an underwater weighing technique.

head(fat)
##   brozek siri density age weight height adipos  free neck chest abdom   hip
## 1   12.6 12.3  1.0708  23 154.25  67.75   23.7 134.9 36.2  93.1  85.2  94.5
## 2    6.9  6.1  1.0853  22 173.25  72.25   23.4 161.3 38.5  93.6  83.0  98.7
## 3   24.6 25.3  1.0414  22 154.00  66.25   24.7 116.0 34.0  95.8  87.9  99.2
## 4   10.9 10.4  1.0751  26 184.75  72.25   24.9 164.7 37.4 101.8  86.4 101.2
## 5   27.8 28.7  1.0340  24 184.25  71.25   25.6 133.1 34.4  97.3 100.0 101.9
## 6   20.6 20.9  1.0502  24 210.25  74.75   26.5 167.0 39.0 104.5  94.4 107.8
##   thigh knee ankle biceps forearm wrist
## 1  59.0 37.3  21.9   32.0    27.4  17.1
## 2  58.7 37.3  23.4   30.5    28.9  18.2
## 3  59.6 38.9  24.0   28.8    25.2  16.6
## 4  60.1 37.3  22.8   32.4    29.4  18.2
## 5  63.2 42.2  24.0   32.2    27.7  17.7
## 6  66.0 42.0  25.6   35.7    30.6  18.8
dim(fat) #252 observations on 18 variables.
## [1] 252  18

Fit a linear regression model.

y = brozek: “Percent body fat using Brozek’s equation, 457/Density - 414.2.”

\[ Y=X\beta +\varepsilon, \varepsilon \sim N(0,\sigma^2 I)\]

lmod<-lm(brozek~age+weight+height+neck+chest+abdom+hip+thigh+knee+ankle+biceps+forearm+wrist,data=fat)
summary(lmod)
## 
## Call:
## lm(formula = brozek ~ age + weight + height + neck + chest + 
##     abdom + hip + thigh + knee + ankle + biceps + forearm + wrist, 
##     data = fat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.264  -2.572  -0.097   2.898   9.327 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.29255   16.06992  -0.952  0.34225    
## age           0.05679    0.02996   1.895  0.05929 .  
## weight       -0.08031    0.04958  -1.620  0.10660    
## height       -0.06460    0.08893  -0.726  0.46830    
## neck         -0.43754    0.21533  -2.032  0.04327 *  
## chest        -0.02360    0.09184  -0.257  0.79740    
## abdom         0.88543    0.08008  11.057  < 2e-16 ***
## hip          -0.19842    0.13516  -1.468  0.14341    
## thigh         0.23190    0.13372   1.734  0.08418 .  
## knee         -0.01168    0.22414  -0.052  0.95850    
## ankle         0.16354    0.20514   0.797  0.42614    
## biceps        0.15280    0.15851   0.964  0.33605    
## forearm       0.43049    0.18445   2.334  0.02044 *  
## wrist        -1.47654    0.49552  -2.980  0.00318 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.988 on 238 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7353 
## F-statistic: 54.63 on 13 and 238 DF,  p-value: < 2.2e-16

For individual \(i\), \(y_i = \beta_0 + \beta_1 x_{i1} + ... + \beta_p x_{i(p-1)} + \varepsilon_i = x_i^T \beta + \varepsilon_i\), where \(\varepsilon_i \sim N(0,\sigma^2)\)

new point: \(x_0 = (x_{01},...,x_{0(p-1)})\) where \(x_{0j}=median(x_{1j},...,x_{nj})\)

  1. Predict a future observation (\(y_0 = x_0^T \beta +\varepsilon_0\)): Ex “What would a house with characteristics \(x_0\) rent?” A future observation with the covariate \(x_{0}\) is predicted to be \[\hat{y}_0 = x_0^T \hat{\beta} +\varepsilon_0\] where \(\varepsilon_0\) is unknown but we expect it has mean zero so the point prediction is \[\hat{y}_0 = x_0^T \hat{\beta}\] \[Var(\hat{y}_0)= x_0^T Var(\hat{\beta})x_0 +Var(\varepsilon_0)= x_0^T (X^TX)^{-1}x_0 \sigma^2 + \sigma^2\] The 95% CI for the future response is \[\hat{y}_0\pm t_{n-p}^{(0.025)}\hat{\sigma}\sqrt{1+x_0^T (X^TX)^{-1}x_0}\]
  2. Predict the mean response of a future observation (\(x_0^T \beta\)): Ex “What would a house with characteristics \(x_0\) rent for on average?” The point estimate for the mean response is \(x_0^T \hat{\beta}=\hat{y}_0\). \[Var(x_0^T \beta)= x_0^T Var(\hat{\beta})x_0 = x_0^T (X^TX)^{-1}x_0 \sigma^2 \] The 95% CI for the mean response \(x_0^T \beta\) is \[\hat{y}_0\pm t_{n-p}^{(0.025)}\hat{\sigma}\sqrt{x_0^T (X^TX)^{-1}x_0}\]
x<-model.matrix(lmod)
(x0<-apply(x,2,median))
## (Intercept)         age      weight      height        neck       chest 
##        1.00       43.00      176.50       70.00       38.00       99.65 
##       abdom         hip       thigh        knee       ankle      biceps 
##       90.95       99.30       59.00       38.50       22.80       32.05 
##     forearm       wrist 
##       28.70       18.30
(y0<-sum(x0*coef(lmod)))  # point estimate
## [1] 17.49322
predict(lmod,new=data.frame(t(x0)))
##        1 
## 17.49322
s.lmod<-summary(lmod)
# 95% CI for the prediction
predict.lower<-y0-qt(0.975,s.lmod$df[2])*s.lmod$sigma*sqrt(1+t(x0)%*%s.lmod$cov.unscaled%*%as.matrix(x0)) #lower bound
predict.upper<-y0+qt(0.975,s.lmod$df[2])*s.lmod$sigma*sqrt(1+t(x0)%*%s.lmod$cov.unscaled%*%as.matrix(x0)) #upper bound
c(predict.lower, predict.upper)
## [1]  9.61783 25.36861
predict(lmod,new=data.frame(t(x0)),interval="prediction") # CI for prediction
##        fit     lwr      upr
## 1 17.49322 9.61783 25.36861
# 95% CI for the prediction
p.mean.lower<-y0-qt(0.975,s.lmod$df[2])*s.lmod$sigma*sqrt(t(x0)%*%s.lmod$cov.unscaled%*%as.matrix(x0)) #lower bound
p.mean.upper<-y0+qt(0.975,s.lmod$df[2])*s.lmod$sigma*sqrt(t(x0)%*%s.lmod$cov.unscaled%*%as.matrix(x0)) #upper bound
c(p.mean.lower, p.mean.upper)
## [1] 16.94426 18.04219
predict(lmod,new=data.frame(t(x0)),interval="confidence") # CI for mean response
##        fit      lwr      upr
## 1 17.49322 16.94426 18.04219