Chapter 11 (Faraway book) exercise 4

Take the fat data, and use the percentage of body fat, siri, as the response and the other variables, except brozek and density as potential predictors. Remove every tenth observation from the data for use as a test sample. Use the remaining data as a training sample building the following models:

  1. Linear regression with all predictors
  2. Linear regression with variables selected using AIC
  3. Principal component regression
  4. Partial least squares
  5. Ridge regression

Use the models you find to predict the response in the test sample. Make a report on the performances of the models.

rm(list=ls(all=TRUE)) 
library(faraway)
library(MASS)
## Warning: package 'MASS' was built under R version 3.1.3
library(leaps)
data(fat)
attach(fat)

dim(fat)
## [1] 252  18
names(fat)
##  [1] "brozek"  "siri"    "density" "age"     "weight"  "height"  "adipos" 
##  [8] "free"    "neck"    "chest"   "abdom"   "hip"     "thigh"   "knee"   
## [15] "ankle"   "biceps"  "forearm" "wrist"
test = seq(10,252,by=10) #index of every 10th obs
tr = fat[-test,] #treatment data
te = fat[test,] #test data

This “fat” dataset has 252 observations on 18 variables. We use “siri” as dependent variable, and others except for “brozek” and “density” as independent variables. Also we use the every 10-th observation as test data. So n=252-25=227, p=16 for the full model.

We use Root Mean Square Error(RMSE) to measure the performance in fitting and predicting.

rmse <- function(x,y) sqrt(mean((x-y)^2))

(a) Linear regression with all predictors

g1 <-lm(siri~.-brozek -density, tr)
summary(g1)
## 
## Call:
## lm(formula = siri ~ . - brozek - density, data = tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8314 -0.6722  0.1828  0.9150  6.6619 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.591885   6.448868  -1.953 0.052193 .  
## age           0.007978   0.012320   0.648 0.517983    
## weight        0.362999   0.023314  15.570  < 2e-16 ***
## height        0.049026   0.040315   1.216 0.225315    
## adipos       -0.514032   0.114074  -4.506 1.09e-05 ***
## free         -0.564773   0.014889 -37.933  < 2e-16 ***
## neck          0.016525   0.089863   0.184 0.854272    
## chest         0.120219   0.039590   3.037 0.002694 ** 
## abdom         0.140108   0.042186   3.321 0.001056 ** 
## hip           0.006197   0.056101   0.110 0.912148    
## thigh         0.195057   0.054460   3.582 0.000424 ***
## knee          0.106637   0.093534   1.140 0.255542    
## ankle         0.125118   0.081303   1.539 0.125325    
## biceps        0.096199   0.064656   1.488 0.138278    
## forearm       0.230775   0.073332   3.147 0.001888 ** 
## wrist         0.139279   0.206804   0.673 0.501378    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.55 on 211 degrees of freedom
## Multiple R-squared:  0.9692, Adjusted R-squared:  0.967 
## F-statistic: 442.5 on 15 and 211 DF,  p-value: < 2.2e-16
rmse(g1$fit, tr$siri) #RMSE of the fitted value vs. observed value 
## [1] 1.494315
pred1 <- predict(g1, te)
y10 <- te$siri
rmse(pred1,y10)  #RMSE of the predicted value vs. observed value (every 10th observation)
## [1] 1.131529

The predicted result seems to be pretty good (with RMSE 1.1315288), even better than the overall fitted model (RMSE 1.4943146) .

(b) Linear regression with variables selected using AIC

g2 <- step(g1)
summary(g2)
## 
## Call:
## lm(formula = siri ~ weight + adipos + free + chest + abdom + 
##     thigh + knee + ankle + biceps + forearm, data = tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7926 -0.6839  0.2371  0.8824  6.8655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.24766    3.94171  -1.839 0.067331 .  
## weight       0.36973    0.01994  18.546  < 2e-16 ***
## adipos      -0.57022    0.09459  -6.028 7.09e-09 ***
## free        -0.55965    0.01400 -39.978  < 2e-16 ***
## chest        0.12099    0.03809   3.176 0.001709 ** 
## abdom        0.15824    0.03908   4.049 7.18e-05 ***
## thigh        0.16140    0.04551   3.546 0.000479 ***
## knee         0.12767    0.08749   1.459 0.145947    
## ankle        0.13817    0.07858   1.758 0.080106 .  
## biceps       0.11222    0.06317   1.777 0.077055 .  
## forearm      0.24281    0.06990   3.474 0.000620 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.544 on 216 degrees of freedom
## Multiple R-squared:  0.9687, Adjusted R-squared:  0.9673 
## F-statistic: 668.6 on 10 and 216 DF,  p-value: < 2.2e-16
rmse(g2$fit, tr$siri) #RMSE of the fitted value vs. observed value 
## [1] 1.505982
pred2 <- predict(g2, te)
rmse(pred2,y10)  #RMSE of the predicted value vs. observed value (10th observation)
## [1] 1.12202

Using AIC, we can get rid of 5 variables, and the predicted value (RMSE 1.1220197) is better than using the full model (RMSE 1.1315288)

(c) Principal component regression

#compute PCA on X variables:
pca <- prcomp(tr[,4:18])
#the square root of eigenvalues:
round(pca$sdev,3)
##  [1] 37.047 14.540  9.511  3.634  3.462  2.544  2.200  1.854  1.577  1.412
## [11]  1.324  1.249  1.037  0.828  0.491
plot(pca$sdev, type="l",ylab="the square root of eigenvalues", xlab="PC number")

We can see that the contribution drops very quickly, most variance can be explained by the first 5 PCs.

matplot(1:15,pca$rot[,1:5],type="l", xlab="Frequency")

Seems that the 1st PC is a linear combination of most the 2nd and 5th variables (weight and free), which agrees with results from previous steps, where the regression coefficients of weight and free are most significant.

# Do regression on PCs:
g3 <- lm(tr$siri ~ pca$x[,1:6])
summary(g3)
## 
## Call:
## lm(formula = tr$siri ~ pca$x[, 1:6])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3999  -0.5813   0.3739   0.9771   7.9487 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     19.180176   0.110619 173.389  < 2e-16 ***
## pca$x[, 1:6]PC1 -0.126730   0.002992 -42.350  < 2e-16 ***
## pca$x[, 1:6]PC2 -0.356175   0.007625 -46.714  < 2e-16 ***
## pca$x[, 1:6]PC3  0.464783   0.011656  39.875  < 2e-16 ***
## pca$x[, 1:6]PC4  0.309275   0.030508  10.138  < 2e-16 ***
## pca$x[, 1:6]PC5 -0.091998   0.032023  -2.873  0.00447 ** 
## pca$x[, 1:6]PC6 -0.222500   0.043574  -5.106 7.11e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.667 on 220 degrees of freedom
## Multiple R-squared:  0.9629, Adjusted R-squared:  0.9618 
## F-statistic: 950.5 on 6 and 220 DF,  p-value: < 2.2e-16

Now do prediction. We need to first center the predictors.

mm = apply(tr[,4:18],2,mean) #the mean of each X variables
tx = as.matrix(sweep(te[,4:18],2,mm)) #centering the predicting data.

nx = tx %*% pca$rot[,1:6]
pred3 = cbind(1,nx) %*% g3$coef

rmse(pred3,y10)
## [1] 1.141839

So the RMSE of PCA using the first 6 PCs is about the same with AIC.

a <- numeric(15)
for (i in 1:15) {
  nx = tx %*% pca$rot[,1:i]
  model4 = lm(tr$siri ~pca$x[,1:i])
  pred4 = cbind(1,nx) %*% model4$coef
  a[i] = rmse(pred4,y10)
}
plot(a, ylab="Test RMSE",xlab="PC number")

which.min(a)
## [1] 7

So when using the first 7 PCs gives the best RMSE.

#cross validataion
library(pls)
## Warning: package 'pls' was built under R version 3.1.3
## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:stats':
## 
##     loadings
trainx = as.matrix(sweep(tr[,4:18],2,mm))
pcrg = pcr(tr$siri~trainx, ncomp=15, validation="CV",grpsize=10)
par(mfrow=c(1,2))
plot(pcrg$validat$adj[1,], xlab="PC number",ylab="CV adj")
plot(MSEP(pcrg)$val[1,,], xlab="PC number", ylab="CV RMS")

#what's the differece between these two plot? Just two alternative ways?

(d) Partial least squares

par(mfrow=c(1,2))
gpls <- plsr(tr$siri~trainx, ncomp=15, validation="CV",grpsize=10)
plot(gpls$validat$adj[1,], xlab="PC number",ylab="CV adj")
plot(MSEP(gpls)$val[1,,], xlab="PC number", ylab="CV RMS")

#seems like the first 4 PC covers most variance, so we'll just use the first 4.

ypred.tr = predict(gpls,  ncomp=4)
rmse(ypred.tr, tr$siri)
## [1] 1.612759
#Testing data 
testx = as.matrix(sweep(te[,4:18],2,mm))
ypred.te = predict(gpls,newdata=testx,ncomp=4)
rmse(ypred.te,te$siri)
## [1] 1.12449

So this method gives reasonably small RMSE.

(e) Ridge regression

par(mfrow=c(1,1))
yc = tr$siri - mean(tr$siri)
gridge = lm.ridge(yc ~trainx, lambda=seq(0,10,1e-4))
matplot(gridge$lambda,t(gridge$coef), type="l",lty=1,xlab=expression(lambda),ylab=expression(hat(beta)),main="the Ridge Trace plot")

#Various automatic selections for lambda are available:
select(gridge)
## modified HKB estimator is 0.1340921 
## modified L-W estimator is 0.4446052 
## smallest value of GCV  at 0.0467
abline(v=0.0467)

#training data
yfit = scale(trainx, center=F, scale=gridge$scales) %*% gridge$coef[,468] + mean(tr$siri)
rmse(yfit,tr$siri )
## [1] 1.494422
#testing 
ypred = scale(testx, center=F, scale=gridge$scales) %*% gridge$coef[,468] + mean(tr$siri)
rmse(ypred,te$siri )
## [1] 1.128089
Summary of different models’ RMSE:

—————|used data | test data

Full model | 1.4943146 | 1.1315288

AIC——- | 1.505982 | 1.1220197

PCA—— | 1.6407477 | 1.1418387

PLS——- | 1.6127595 | 1.1244896

Ridge—– | 1.4944225 | 1.1280885