Take the fat data, and use the percentage of body fat, siri, as the response and the other variables, except brozek and density as potential predictors. Remove every tenth observation from the data for use as a test sample. Use the remaining data as a training sample building the following models:
Use the models you find to predict the response in the test sample. Make a report on the performances of the models.
rm(list=ls(all=TRUE))
library(faraway)
library(MASS)
## Warning: package 'MASS' was built under R version 3.1.3
library(leaps)
data(fat)
attach(fat)
dim(fat)
## [1] 252 18
names(fat)
## [1] "brozek" "siri" "density" "age" "weight" "height" "adipos"
## [8] "free" "neck" "chest" "abdom" "hip" "thigh" "knee"
## [15] "ankle" "biceps" "forearm" "wrist"
test = seq(10,252,by=10) #index of every 10th obs
tr = fat[-test,] #treatment data
te = fat[test,] #test data
This “fat” dataset has 252 observations on 18 variables. We use “siri” as dependent variable, and others except for “brozek” and “density” as independent variables. Also we use the every 10-th observation as test data. So n=252-25=227, p=16 for the full model.
We use Root Mean Square Error(RMSE) to measure the performance in fitting and predicting.
rmse <- function(x,y) sqrt(mean((x-y)^2))
g1 <-lm(siri~.-brozek -density, tr)
summary(g1)
##
## Call:
## lm(formula = siri ~ . - brozek - density, data = tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8314 -0.6722 0.1828 0.9150 6.6619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.591885 6.448868 -1.953 0.052193 .
## age 0.007978 0.012320 0.648 0.517983
## weight 0.362999 0.023314 15.570 < 2e-16 ***
## height 0.049026 0.040315 1.216 0.225315
## adipos -0.514032 0.114074 -4.506 1.09e-05 ***
## free -0.564773 0.014889 -37.933 < 2e-16 ***
## neck 0.016525 0.089863 0.184 0.854272
## chest 0.120219 0.039590 3.037 0.002694 **
## abdom 0.140108 0.042186 3.321 0.001056 **
## hip 0.006197 0.056101 0.110 0.912148
## thigh 0.195057 0.054460 3.582 0.000424 ***
## knee 0.106637 0.093534 1.140 0.255542
## ankle 0.125118 0.081303 1.539 0.125325
## biceps 0.096199 0.064656 1.488 0.138278
## forearm 0.230775 0.073332 3.147 0.001888 **
## wrist 0.139279 0.206804 0.673 0.501378
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.55 on 211 degrees of freedom
## Multiple R-squared: 0.9692, Adjusted R-squared: 0.967
## F-statistic: 442.5 on 15 and 211 DF, p-value: < 2.2e-16
rmse(g1$fit, tr$siri) #RMSE of the fitted value vs. observed value
## [1] 1.494315
pred1 <- predict(g1, te)
y10 <- te$siri
rmse(pred1,y10) #RMSE of the predicted value vs. observed value (every 10th observation)
## [1] 1.131529
The predicted result seems to be pretty good (with RMSE 1.1315288), even better than the overall fitted model (RMSE 1.4943146) .
g2 <- step(g1)
summary(g2)
##
## Call:
## lm(formula = siri ~ weight + adipos + free + chest + abdom +
## thigh + knee + ankle + biceps + forearm, data = tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7926 -0.6839 0.2371 0.8824 6.8655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.24766 3.94171 -1.839 0.067331 .
## weight 0.36973 0.01994 18.546 < 2e-16 ***
## adipos -0.57022 0.09459 -6.028 7.09e-09 ***
## free -0.55965 0.01400 -39.978 < 2e-16 ***
## chest 0.12099 0.03809 3.176 0.001709 **
## abdom 0.15824 0.03908 4.049 7.18e-05 ***
## thigh 0.16140 0.04551 3.546 0.000479 ***
## knee 0.12767 0.08749 1.459 0.145947
## ankle 0.13817 0.07858 1.758 0.080106 .
## biceps 0.11222 0.06317 1.777 0.077055 .
## forearm 0.24281 0.06990 3.474 0.000620 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.544 on 216 degrees of freedom
## Multiple R-squared: 0.9687, Adjusted R-squared: 0.9673
## F-statistic: 668.6 on 10 and 216 DF, p-value: < 2.2e-16
rmse(g2$fit, tr$siri) #RMSE of the fitted value vs. observed value
## [1] 1.505982
pred2 <- predict(g2, te)
rmse(pred2,y10) #RMSE of the predicted value vs. observed value (10th observation)
## [1] 1.12202
Using AIC, we can get rid of 5 variables, and the predicted value (RMSE 1.1220197) is better than using the full model (RMSE 1.1315288)
#compute PCA on X variables:
pca <- prcomp(tr[,4:18])
#the square root of eigenvalues:
round(pca$sdev,3)
## [1] 37.047 14.540 9.511 3.634 3.462 2.544 2.200 1.854 1.577 1.412
## [11] 1.324 1.249 1.037 0.828 0.491
plot(pca$sdev, type="l",ylab="the square root of eigenvalues", xlab="PC number")
We can see that the contribution drops very quickly, most variance can be explained by the first 5 PCs.
matplot(1:15,pca$rot[,1:5],type="l", xlab="Frequency")
Seems that the 1st PC is a linear combination of most the 2nd and 5th variables (weight and free), which agrees with results from previous steps, where the regression coefficients of weight and free are most significant.
# Do regression on PCs:
g3 <- lm(tr$siri ~ pca$x[,1:6])
summary(g3)
##
## Call:
## lm(formula = tr$siri ~ pca$x[, 1:6])
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3999 -0.5813 0.3739 0.9771 7.9487
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.180176 0.110619 173.389 < 2e-16 ***
## pca$x[, 1:6]PC1 -0.126730 0.002992 -42.350 < 2e-16 ***
## pca$x[, 1:6]PC2 -0.356175 0.007625 -46.714 < 2e-16 ***
## pca$x[, 1:6]PC3 0.464783 0.011656 39.875 < 2e-16 ***
## pca$x[, 1:6]PC4 0.309275 0.030508 10.138 < 2e-16 ***
## pca$x[, 1:6]PC5 -0.091998 0.032023 -2.873 0.00447 **
## pca$x[, 1:6]PC6 -0.222500 0.043574 -5.106 7.11e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.667 on 220 degrees of freedom
## Multiple R-squared: 0.9629, Adjusted R-squared: 0.9618
## F-statistic: 950.5 on 6 and 220 DF, p-value: < 2.2e-16
Now do prediction. We need to first center the predictors.
mm = apply(tr[,4:18],2,mean) #the mean of each X variables
tx = as.matrix(sweep(te[,4:18],2,mm)) #centering the predicting data.
nx = tx %*% pca$rot[,1:6]
pred3 = cbind(1,nx) %*% g3$coef
rmse(pred3,y10)
## [1] 1.141839
So the RMSE of PCA using the first 6 PCs is about the same with AIC.
a <- numeric(15)
for (i in 1:15) {
nx = tx %*% pca$rot[,1:i]
model4 = lm(tr$siri ~pca$x[,1:i])
pred4 = cbind(1,nx) %*% model4$coef
a[i] = rmse(pred4,y10)
}
plot(a, ylab="Test RMSE",xlab="PC number")
which.min(a)
## [1] 7
So when using the first 7 PCs gives the best RMSE.
#cross validataion
library(pls)
## Warning: package 'pls' was built under R version 3.1.3
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:stats':
##
## loadings
trainx = as.matrix(sweep(tr[,4:18],2,mm))
pcrg = pcr(tr$siri~trainx, ncomp=15, validation="CV",grpsize=10)
par(mfrow=c(1,2))
plot(pcrg$validat$adj[1,], xlab="PC number",ylab="CV adj")
plot(MSEP(pcrg)$val[1,,], xlab="PC number", ylab="CV RMS")
#what's the differece between these two plot? Just two alternative ways?
par(mfrow=c(1,2))
gpls <- plsr(tr$siri~trainx, ncomp=15, validation="CV",grpsize=10)
plot(gpls$validat$adj[1,], xlab="PC number",ylab="CV adj")
plot(MSEP(gpls)$val[1,,], xlab="PC number", ylab="CV RMS")
#seems like the first 4 PC covers most variance, so we'll just use the first 4.
ypred.tr = predict(gpls, ncomp=4)
rmse(ypred.tr, tr$siri)
## [1] 1.612759
#Testing data
testx = as.matrix(sweep(te[,4:18],2,mm))
ypred.te = predict(gpls,newdata=testx,ncomp=4)
rmse(ypred.te,te$siri)
## [1] 1.12449
So this method gives reasonably small RMSE.
par(mfrow=c(1,1))
yc = tr$siri - mean(tr$siri)
gridge = lm.ridge(yc ~trainx, lambda=seq(0,10,1e-4))
matplot(gridge$lambda,t(gridge$coef), type="l",lty=1,xlab=expression(lambda),ylab=expression(hat(beta)),main="the Ridge Trace plot")
#Various automatic selections for lambda are available:
select(gridge)
## modified HKB estimator is 0.1340921
## modified L-W estimator is 0.4446052
## smallest value of GCV at 0.0467
abline(v=0.0467)
#training data
yfit = scale(trainx, center=F, scale=gridge$scales) %*% gridge$coef[,468] + mean(tr$siri)
rmse(yfit,tr$siri )
## [1] 1.494422
#testing
ypred = scale(testx, center=F, scale=gridge$scales) %*% gridge$coef[,468] + mean(tr$siri)
rmse(ypred,te$siri )
## [1] 1.128089
—————|used data | test data
Full model | 1.4943146 | 1.1315288
AIC——- | 1.505982 | 1.1220197
PCA—— | 1.6407477 | 1.1418387
PLS——- | 1.6127595 | 1.1244896
Ridge—– | 1.4944225 | 1.1280885