The problems on this assignment require you to use GAM - boosted GAM models to predict a response variable from a given set of covariate variables. Answer all questions specified on the problem but if you see something interesting and want to do more analysis please report it as well. Don’t forget to include discussion and to make your plots look nice and presentable. Submit your .rmd file with the knitted PDF (or knitted Word Document saved as a PDF). If you are still having trouble with .rmd, let us know and we will help you, but we are going to start requiring this from here on out.
library(TH.data)
library(GGally)
library(ggplot2)
library(mgcv)
library(dplyr)
library(tidyr)
library(mboost)
library(gamair)
set.seed(12345)
Consider the body fat data introduced in Chapter 9 (bodyfat data from TH.data package).
Explore the data graphically. What variables do you think need to be included for predicting bodyfat? (Hint: Are there correlated predictors - ggpairs()).
First we will us the ggpairs correlation matrix plot to examine the data. There are several highly correlated variables that could be removed from the data with correlation coefficients above .95. These variables include: anthro3a to anthro3b & anthro4, anthro3b to anthroc & anthro4, and anthroc to anthro4. We will create a matrix and remove the variables with high correlation. The printout of the kept and removed variables is shown below. Additionally, hipcirc is correlated to waist circ by the correlation coefficient and conceptually. However, the variables used the the part b GAM model include both which was the determining factor in raising the maximum coefficient threshold to 0.95. We will also examine the variables from upper and lower triangle selection with correlated variables with a coefficient of 0.8 and above removed. These subsets will be used later in this problem for model comparisons.
a <- bodyfat[,!names(bodyfat) %in% nms]
names(a)
## [1] "anthro3b" "anthro4"
names(body.lower)
## [1] "age" "waistcirc" "elbowbreadth" "kneebreadth"
## [5] "anthro3a" "DEXfat"
names(body.upper)
## [1] "age" "hipcirc" "elbowbreadth" "kneebreadth"
## [5] "anthro4" "DEXfat"
Fit a generalised additive model assuming normal errors using function gam- the following code.
bodyfat_gam <- gam(DEXfat~ s(age) + s(waistcirc) + s(hipcirc) + s(elbowbreadth) + s(kneebreadth)+ s(anthro3a) + s(anthro3c), data = bodyfat)
Answer - The summary shows that age, elbowbreadth, and anthro3c are not significant at the 0.05 level. The variables age, waistcirc, elbowbreadth and anthro3a all have linear relationships as shown in the plots and the summary of the model by the dgrees of freedom all of which indicate a degree of freedom of one. The hipcirc variable as shown by the plot and by the degrees of freedom nearing 2, can be explained by a second order smooth function, kneebreadth by an eight order smooth function, and anthro3c by a seventh order smooth function. The model has a moderate GCV and high AIC which could possibily be adjusted by variable selection and using smoothing functions on the variables aformentioned. The degrees of freedom being close to 21 indicates that we need to impose smoothing terms as well. The degrees of freedom should equal the number of smoothing terms to ensure an optimum model. The adjusted \(R^2\) does indicate that the model explains most of the variance, but as stated previously the model can improve.
bodyfat_gam <- gam(DEXfat~ s(age) + s(waistcirc) + s(hipcirc) + s(elbowbreadth) + s(kneebreadth)+ s(anthro3a) + s(anthro3c), data = body.fat)
summary(bodyfat_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DEXfat ~ s(age) + s(waistcirc) + s(hipcirc) + s(elbowbreadth) +
## s(kneebreadth) + s(anthro3a) + s(anthro3c)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.7828 0.2822 109.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 1.0000 1.0000 1.039 0.313017
## s(waistcirc) 0.9999 0.9999 12.122 0.001031 **
## s(hipcirc) 1.7772 2.2361 10.542 9.24e-05 ***
## s(elbowbreadth) 1.0000 1.0000 0.001 0.972029
## s(kneebreadth) 7.8804 8.0317 7.589 9.16e-07 ***
## s(anthro3a) 0.9999 0.9999 14.040 0.000455 ***
## s(anthro3c) 7.0638 8.0350 1.895 0.082240 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 63/64
## R-sq.(adj) = 0.954 Deviance explained = 96.7%
## GCV = 8.1472 Scale est. = 5.6547 n = 71
par(mfrow=c(2,4))
plot(bodyfat_gam,select=1)
plot(bodyfat_gam,select=2)
plot(bodyfat_gam,select=3)
plot(bodyfat_gam,select=4)
plot(bodyfat_gam,select=5)
plot(bodyfat_gam,select=6)
plot(bodyfat_gam,select=7)
bodyfat_gam$gcv.ubre
## GCV.Cp
## 8.147249
AIC(bodyfat_gam)
## [1] 344.0106
summary(bodyfat_gam)$r.sq
## [1] 0.9536278
sum(summary(bodyfat_gam)$edf)
## [1] 20.72114
Now remove insignificant variables and remove smoothing for some variables. Report summary, plot, GCV, AIC, adj-R2. (Fit the following model as well as another one you come up with on your own, justifying the variables and smoothing you use).
bodyfat_gam2 <- gam(DEXfat~ waistcirc + s(hipcirc) + s(kneebreadth)+ anthro3a + s(anthro3c), data = bodyfat)
bodyfat_gam2 <- gam(DEXfat~ waistcirc + s(hipcirc) + s(kneebreadth)+ anthro3a + s(anthro3c), data = bodyfat)
summary(bodyfat_gam2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DEXfat ~ waistcirc + s(hipcirc) + s(kneebreadth) + anthro3a +
## s(anthro3c)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.19588 7.12570 -1.852 0.069897 .
## waistcirc 0.19654 0.05425 3.623 0.000676 ***
## anthro3a 6.92774 1.63128 4.247 9.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(hipcirc) 1.610 2.010 10.910 0.000103 ***
## s(kneebreadth) 8.793 8.970 6.780 2.48e-06 ***
## s(anthro3c) 7.117 8.103 2.126 0.048737 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.954 Deviance explained = 96.7%
## GCV = 7.9464 Scale est. = 5.6498 n = 71
gam2.gcv <- round(bodyfat_gam2$gcv.ubre,3)
gam2.AIC <- round(bodyfat_gam2$aic,3)
gam2.rsq <- round(summary(bodyfat_gam2)$r.sq,3)
gam2.df <- round(sum(summary(bodyfat_gam2)$edf),3)
gam2.stat <- c( gam2.gcv,gam2.AIC,gam2.rsq,gam2.df)
par(mfrow=c(1,3))
plot(bodyfat_gam2,select=1)
plot(bodyfat_gam2,select=2)
plot(bodyfat_gam2,select=3)
In this model we will use the triangle selecion from the first part of the problem where we removed correlated variables based upon the upper triange. The waistcirc, kneebreadth, and anthro3a variables are all given smoothing functions as determined by our previous analysis.
bodyfat_gam3 <- gam(DEXfat~ s(waistcirc) + elbowbreadth + s(kneebreadth)+ s(anthro3a), data = body.lower)
summary(bodyfat_gam3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DEXfat ~ s(waistcirc) + elbowbreadth + s(kneebreadth) + s(anthro3a)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6149 6.0356 6.564 1.57e-08 ***
## elbowbreadth -1.3570 0.9257 -1.466 0.148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(waistcirc) 1.000 1.000 66.688 3.10e-12 ***
## s(kneebreadth) 8.884 8.995 7.389 9.25e-08 ***
## s(anthro3a) 1.000 1.000 53.600 2.24e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.927 Deviance explained = 93.9%
## GCV = 10.937 Scale est. = 8.9522 n = 71
gam3.gcv <- round(bodyfat_gam3$gcv.ubre,3)
gam3.AIC <- round(bodyfat_gam3$aic,3)
gam3.rsq <- round(summary(bodyfat_gam3)$r.sq,3)
gam3.df <- round(sum(summary(bodyfat_gam3)$edf),3)
gam3.stat <- c(gam3.gcv,gam3.AIC,gam3.rsq,gam3.df)
par(mfrow=c(1,3))
plot(bodyfat_gam3,select=1)
plot(bodyfat_gam3,select=2)
plot(bodyfat_gam3,select=3)
In this model we will use the lower triangle selection variables from our correlation matrix. The variables hipcirc, kneebreadth, and anthro4 will be used with smoothing parameters.
bodyfat_gam4 <- gam(DEXfat~ s(hipcirc) + s(kneebreadth)+ s(anthro4), data = body.upper)
summary(bodyfat_gam4)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DEXfat ~ s(hipcirc) + s(kneebreadth) + s(anthro4)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.7828 0.3405 90.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(hipcirc) 1.779 2.237 26.298 8.75e-10 ***
## s(kneebreadth) 8.711 8.963 4.683 5.73e-05 ***
## s(anthro4) 1.839 2.304 32.181 1.48e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.932 Deviance explained = 94.4%
## GCV = 10.137 Scale est. = 8.2338 n = 71
gam4.gcv <- round(bodyfat_gam4$gcv.ubre,3)
gam4.AIC <- round(bodyfat_gam4$aic,3)
gam4.rsq <- round(summary(bodyfat_gam)$r.sq,3)
gam4.df <- round(sum(summary(bodyfat_gam4)$edf),3)
gam4.stat <- c(gam4.gcv,gam4.AIC,gam4.rsq,gam4.df)
par(mfrow=c(1,3))
plot(bodyfat_gam4,select=1)
plot(bodyfat_gam4,select=2)
plot(bodyfat_gam4,select=3)
#plot(bodyfat_gam4,select=4)
Answer - Judging by the below table it is apparent that keeping the variables that are highly correlated (waistcirc & hipcirc with anthro3c) reduces the GCV score by 2, lowers the AIC by 20, has relatively the same adjusted \(R^2\) value and has a higher degree of freedom than the upper triangle model. Thus we should keep these correlated variables rather than removing them from the model.
gams.table <- data.frame('Model.1' = gam2.stat,'Lower.Triangle'=gam3.stat,'Upper.Triangle'=gam4.stat)
row.names(gams.table) <- c('GCV','AIC','Adj.Rsq','DF')
gams.table
## Model.1 Lower.Triangle Upper.Triangle
## GCV 7.946 10.937 10.137
## AIC 343.256 370.665 365.070
## Adj.Rsq 0.954 0.927 0.954
## DF 17.520 10.884 12.330
Again fit an additive model to the body fat data, but this time for a log-transformed response. Compare the four models, which one is more appropriate? (Hint: use AIC, Adj-R2, residual plots, etc. to compare models).
Answer: The log transformation of the response variable greatly reduces the GCV and AIC of the model. It deos not do any better at explainng variance but the covariates are noticably more smooth than previous models. Interestingly, the kneebreadth variable is no longer significant at the 0.05 level.
bodyfat_gam5 <- gam(log(DEXfat)~ waistcirc + s(hipcirc) + s(kneebreadth)+ anthro3a + s(anthro3c), data = bodyfat)
summary(bodyfat_gam5)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(DEXfat) ~ waistcirc + s(hipcirc) + s(kneebreadth) + anthro3a +
## s(anthro3c)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.139779 0.237083 9.025 1.8e-12 ***
## waistcirc 0.004418 0.001806 2.447 0.017610 *
## anthro3a 0.215488 0.054600 3.947 0.000226 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(hipcirc) 2.909 3.616 11.828 8.8e-07 ***
## s(kneebreadth) 2.325 2.962 2.027 0.128320
## s(anthro3c) 7.358 8.263 4.678 0.000144 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.952 Deviance explained = 96.2%
## GCV = 0.0088137 Scale est. = 0.006878 n = 71
gam5.gcv <- round(bodyfat_gam5$gcv.ubre,3)
gam5.AIC <- round(bodyfat_gam5$aic,3)
gam5.rsq <- round(summary(bodyfat_gam5)$r.sq,3)
gam5.df <- round(sum(summary(bodyfat_gam5)$edf),3)
gam5.stat <- c( gam5.gcv,gam5.AIC,gam5.rsq,gam5.df)
par(mfrow=c(1,3))
plot(bodyfat_gam5,select=1)
plot(bodyfat_gam5,select=2)
plot(bodyfat_gam5,select=3)
gams.table <- cbind(gams.table,'Log.Trans'=gam5.stat)
gams.table
## Model.1 Lower.Triangle Upper.Triangle Log.Trans
## GCV 7.946 10.937 10.137 0.009
## AIC 343.256 370.665 365.070 -136.470
## Adj.Rsq 0.954 0.927 0.954 0.952
## DF 17.520 10.884 12.330 12.593
Fit generalised additive model that underwent AIC-based variable selection (fitted using function gamboost()). What variable was removed by using AIC? bodyfat_boost <- gamboost(DEXfat~., data = bodyfat) bodyfat_aic <- AIC(bodyfat_boost) bf_gam <- bodyfat_boost[mstop(bodyfat_aic)]
Answer: The only variable removed with the GAM Gradient Boost function is age, which was removed on all previous models. Although the AIC has drastically been reduced it is very hard to interpret the results. Judging by the output of the plots, all of the variables have been smoothed to the point of appearing linear aside from elbowbreadth.
bodyfat_boost <- gamboost(DEXfat~., data = bodyfat)
bodyfat_aic <- AIC(bodyfat_boost)
bf_gam <- bodyfat_boost[mstop(bodyfat_aic)]
summary(bodyfat_boost)
##
## Model-based Boosting
##
## Call:
## gamboost(formula = DEXfat ~ ., data = bodyfat)
##
##
## Squared Error (Regression)
##
## Loss function: (y - f)^2
##
##
## Number of boosting iterations: mstop = 51
## Step size: 0.1
## Offset: 30.78282
## Number of baselearners: 9
##
## Selection frequencies:
## bbs(kneebreadth, df = dfbase) bbs(anthro3b, df = dfbase)
## 0.35294118 0.17647059
## bbs(hipcirc, df = dfbase) bbs(anthro3a, df = dfbase)
## 0.13725490 0.11764706
## bbs(anthro3c, df = dfbase) bbs(waistcirc, df = dfbase)
## 0.09803922 0.07843137
## bbs(elbowbreadth, df = dfbase) bbs(anthro4, df = dfbase)
## 0.01960784 0.01960784
extract(bf_gam,what='variable.names')
## bbs(waistcirc, df = dfbase) bbs(hipcirc, df = dfbase)
## "waistcirc" "hipcirc"
## bbs(elbowbreadth, df = dfbase) bbs(kneebreadth, df = dfbase)
## "elbowbreadth" "kneebreadth"
## bbs(anthro3a, df = dfbase) bbs(anthro3b, df = dfbase)
## "anthro3a" "anthro3b"
## bbs(anthro3c, df = dfbase) bbs(anthro4, df = dfbase)
## "anthro3c" "anthro4"
bodyfat_aic
## [1] 3.268173
## Optimal number of boosting iterations: 51
## Degrees of freedom (for mstop = 51): 7.637287
par(mfrow=c(2,4))
plot(bodyfat_boost,ask=F)
Fit a logistic additive model to the glaucoma data. (Here use family = “binomial”). Which covariates should enter the model and what is their influence on the probability of suffering from glaucoma? (Hint: since there are many covariates use gamboost() to fit the GAM model.)
Answer: Below the selection frequencies show the variables used in the model and their influence on the probability of suffering from glaucoma. Using 10 fold cross validation we also see the prediction error of this model to be about 0.1452 which is higher than the recursive partitioning functions used in last week’s assignment.
data(GlaucomaM)
glauc_boost <- gamboost(Class~., data = GlaucomaM,family=Binomial())
summary(glauc_boost)
##
## Model-based Boosting
##
## Call:
## gamboost(formula = Class ~ ., data = GlaucomaM, family = Binomial())
##
##
## Negative Binomial Likelihood (logit link)
##
## Loss function: {
## f <- pmin(abs(f), 36) * sign(f)
## p <- exp(f)/(exp(f) + exp(-f))
## y <- (y + 1)/2
## -y * log(p) - (1 - y) * log(1 - p)
## }
##
##
## Number of boosting iterations: mstop = 100
## Step size: 0.1
## Offset: 0
## Number of baselearners: 62
##
## Selection frequencies:
## bbs(tmi, df = dfbase) bbs(mhcg, df = dfbase) bbs(vars, df = dfbase)
## 0.17 0.11 0.11
## bbs(mhci, df = dfbase) bbs(hvc, df = dfbase) bbs(vass, df = dfbase)
## 0.10 0.08 0.08
## bbs(as, df = dfbase) bbs(vari, df = dfbase) bbs(mv, df = dfbase)
## 0.07 0.06 0.04
## bbs(abrs, df = dfbase) bbs(mhcn, df = dfbase) bbs(phcn, df = dfbase)
## 0.03 0.03 0.03
## bbs(mdn, df = dfbase) bbs(phci, df = dfbase) bbs(hic, df = dfbase)
## 0.03 0.02 0.01
## bbs(phcg, df = dfbase) bbs(mdi, df = dfbase) bbs(tms, df = dfbase)
## 0.01 0.01 0.01
n_folds <- 10
folds_i <- sample(rep(1:n_folds, length.out = length(GlaucomaM)))
cv_error <- matrix(NA, nrow = n_folds, ncol = length(GlaucomaM))
for (k in 1:n_folds) {
test_i <- which(folds_i == k)
train <- GlaucomaM[-test_i, ]
test <- GlaucomaM[test_i, ]
glauc.boost <- gamboost(Class~., data = train,family=Binomial())
pred <- predict(glauc.boost,test,type='response')
conf <- as.data.frame(table("actual"=test$Class,"predicted"=pred>0.5))
cv_error[k, ] <- round(1 - (sum(conf[conf$predicted==TRUE,]$Freq)) / nrow(test), 3)
}
glauc.boost.error <- colMeans(cv_error)[1]
glauc.boost.error
## [1] 0.1452
Investigate the use of different types of scatterplot smoothers on the Hubble data from Chapter 6. (Hint: Look at chapter 10.3.1 (Olympic 1500m Times) in the Handbook).
Answer: The code is straight forward and there is not much to interpret. It is of note that to properly plot the predicted values of the GAM and quadratic model you have to order the values of x and the predicted values of y to generate a clean plot.
data(hubble)
hub_lm <- lm(y~x,data=hubble)
hub_lm2 <- lm(y~x+I(x^2),data=hubble)
hub_gam <- gam(y~s(x,bs='cr'),data=hubble)
hubble_lowess <- lowess(hubble$x,hubble$y)
lm2.pred <- predict(hub_lm2)
gam.pred <- predict(hub_gam)
plot(y ~ x, data = hubble, main = 'Scatterplot of Hubble X Y Data to a Linear Model')
abline(hub_lm)
plot(y ~ x, data = hubble,main='Scatterplot of Hubble X Y Data to a Loess Smooth Function')
lines(hubble_lowess, lty = 2)
plot(y ~ x, data = hubble, main = 'Scatterplot of Hubble X Y Data to a Quadratic Model')
lines(hubble$x[order(hubble$x)], lm2.pred[order(lm2.pred)])
plot(y ~ x, data = hubble, main = 'Scatterplot of Hubble X Y Data to a GAM Model')
lines(hubble$x[order(hubble$x)], gam.pred[order(gam.pred)])