Generalized Additive Models and Spline Models

The problems on this assignment require you to use GAM - boosted GAM models to predict a response variable from a given set of covariate variables. Answer all questions specified on the problem but if you see something interesting and want to do more analysis please report it as well. Don’t forget to include discussion and to make your plots look nice and presentable. Submit your .rmd file with the knitted PDF (or knitted Word Document saved as a PDF). If you are still having trouble with .rmd, let us know and we will help you, but we are going to start requiring this from here on out.

library(TH.data)
library(GGally)
library(ggplot2)
library(mgcv)
library(dplyr)
library(tidyr)
library(mboost)
library(gamair)
set.seed(12345)

Question 1

Consider the body fat data introduced in Chapter 9 (bodyfat data from TH.data package).

Part a.

Explore the data graphically. What variables do you think need to be included for predicting bodyfat? (Hint: Are there correlated predictors - ggpairs()).

First we will us the ggpairs correlation matrix plot to examine the data. There are several highly correlated variables that could be removed from the data with correlation coefficients above .95. These variables include: anthro3a to anthro3b & anthro4, anthro3b to anthroc & anthro4, and anthroc to anthro4. We will create a matrix and remove the variables with high correlation. The printout of the kept and removed variables is shown below. Additionally, hipcirc is correlated to waist circ by the correlation coefficient and conceptually. However, the variables used the the part b GAM model include both which was the determining factor in raising the maximum coefficient threshold to 0.95. We will also examine the variables from upper and lower triangle selection with correlated variables with a coefficient of 0.8 and above removed. These subsets will be used later in this problem for model comparisons.

Analysis of Highly Correlated Variables

data(bodyfat)

ggpairs(bodyfat)

#removal of highly correlated variables
#create an object class containing the DEXfat variable and remove it from the data set
DEXfat <- bodyfat$DEXfat
fat <- bodyfat[,!names(bodyfat) %in% 'DEXfat']


#create a correlation matrix of the dataset and set the upper triangle and perfecly correlated variables to 0
tmp <- cor(fat)
tmp[lower.tri(tmp)] <- 0
diag(tmp) <- 0

#remove any variables with a correlation absolute value of .95 (as observed from the model in part b) or above and add the DEXfat variable back
body.fat <- fat[,!apply(tmp,2,function(x) any(x > abs(0.95)))]
body.fat <- cbind(body.fat,DEXfat)

#lower tail with 0.8 coefficiens and above removed
tmp2 <- cor(fat)
tmp2[lower.tri(tmp2)] <- 0
diag(tmp2) <- 0
body.lower <- fat[,!apply(tmp2,2,function(x) any(x > abs(0.8)))]
body.lower <- cbind(body.lower,DEXfat)

tmp3 <- cor(fat)
tmp3[upper.tri(tmp3)] <- 0
diag(tmp3) <- 0
body.upper <- fat[,!apply(tmp3,2,function(x) any(x > abs(0.8)))]
body.upper <- cbind(body.upper,DEXfat)

Examining the Variables Kept After Removing Highly Correlated Variables

nms <-names(body.fat)

nms
## [1] "age"          "waistcirc"    "hipcirc"      "elbowbreadth"
## [5] "kneebreadth"  "anthro3a"     "anthro3c"     "DEXfat"

Variables Removed from Original Data

a <- bodyfat[,!names(bodyfat) %in% nms]
names(a)
## [1] "anthro3b" "anthro4"

Variables for Lower Triangle Removal & Correlation Removal Above 0.8

names(body.lower)
## [1] "age"          "waistcirc"    "elbowbreadth" "kneebreadth" 
## [5] "anthro3a"     "DEXfat"

Variables for Upper Triangle Removal & Correlation Removal Above 0.8

names(body.upper)
## [1] "age"          "hipcirc"      "elbowbreadth" "kneebreadth" 
## [5] "anthro4"      "DEXfat"

Part b.

Fit a generalised additive model assuming normal errors using function gam- the following code.

bodyfat_gam <- gam(DEXfat~ s(age) + s(waistcirc) + s(hipcirc) + s(elbowbreadth) + s(kneebreadth)+ s(anthro3a) + s(anthro3c), data = bodyfat)

  • Assess the summary() and plot() of the model. Are all covariates informative? Should all covariates be smoothed or should some be included as a linear effect?
  • Report GCV (minimized generalized cross-validation score), AIC, adj-R2, and total model degrees of freedom.
  • Use gam.check() function to look at diagnostic plot. Does it appear that the normality assumption is violated?
  • Write a discussion on all of but not limited to the above points.

Answer - The summary shows that age, elbowbreadth, and anthro3c are not significant at the 0.05 level. The variables age, waistcirc, elbowbreadth and anthro3a all have linear relationships as shown in the plots and the summary of the model by the dgrees of freedom all of which indicate a degree of freedom of one. The hipcirc variable as shown by the plot and by the degrees of freedom nearing 2, can be explained by a second order smooth function, kneebreadth by an eight order smooth function, and anthro3c by a seventh order smooth function. The model has a moderate GCV and high AIC which could possibily be adjusted by variable selection and using smoothing functions on the variables aformentioned. The degrees of freedom being close to 21 indicates that we need to impose smoothing terms as well. The degrees of freedom should equal the number of smoothing terms to ensure an optimum model. The adjusted \(R^2\) does indicate that the model explains most of the variance, but as stated previously the model can improve.

Fit the Model & Print the Summary

bodyfat_gam <- gam(DEXfat~ s(age) + s(waistcirc) + s(hipcirc) + s(elbowbreadth) + s(kneebreadth)+ s(anthro3a) + s(anthro3c), data = body.fat)
summary(bodyfat_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DEXfat ~ s(age) + s(waistcirc) + s(hipcirc) + s(elbowbreadth) + 
##     s(kneebreadth) + s(anthro3a) + s(anthro3c)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.7828     0.2822   109.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df      F  p-value    
## s(age)          1.0000 1.0000  1.039 0.313017    
## s(waistcirc)    0.9999 0.9999 12.122 0.001031 ** 
## s(hipcirc)      1.7772 2.2361 10.542 9.24e-05 ***
## s(elbowbreadth) 1.0000 1.0000  0.001 0.972029    
## s(kneebreadth)  7.8804 8.0317  7.589 9.16e-07 ***
## s(anthro3a)     0.9999 0.9999 14.040 0.000455 ***
## s(anthro3c)     7.0638 8.0350  1.895 0.082240 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 63/64
## R-sq.(adj) =  0.954   Deviance explained = 96.7%
## GCV = 8.1472  Scale est. = 5.6547    n = 71

Plot the Model

par(mfrow=c(2,4))
plot(bodyfat_gam,select=1)
plot(bodyfat_gam,select=2)
plot(bodyfat_gam,select=3)
plot(bodyfat_gam,select=4)
plot(bodyfat_gam,select=5)
plot(bodyfat_gam,select=6)
plot(bodyfat_gam,select=7)

Minimized Generalized Cross-Validation Score

bodyfat_gam$gcv.ubre
##   GCV.Cp 
## 8.147249

AIC

AIC(bodyfat_gam)
## [1] 344.0106

Adjusted \(R^2\)

summary(bodyfat_gam)$r.sq
## [1] 0.9536278

Total Model Degrees of Freedom

sum(summary(bodyfat_gam)$edf)
## [1] 20.72114

Part c

Now remove insignificant variables and remove smoothing for some variables. Report summary, plot, GCV, AIC, adj-R2. (Fit the following model as well as another one you come up with on your own, justifying the variables and smoothing you use).

bodyfat_gam2 <- gam(DEXfat~ waistcirc + s(hipcirc) + s(kneebreadth)+ anthro3a + s(anthro3c), data = bodyfat)

Create the First Model and Print the Summary

bodyfat_gam2 <- gam(DEXfat~ waistcirc + s(hipcirc) + s(kneebreadth)+ anthro3a + s(anthro3c), data = bodyfat)
summary(bodyfat_gam2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DEXfat ~ waistcirc + s(hipcirc) + s(kneebreadth) + anthro3a + 
##     s(anthro3c)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.19588    7.12570  -1.852 0.069897 .  
## waistcirc     0.19654    0.05425   3.623 0.000676 ***
## anthro3a      6.92774    1.63128   4.247 9.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F  p-value    
## s(hipcirc)     1.610  2.010 10.910 0.000103 ***
## s(kneebreadth) 8.793  8.970  6.780 2.48e-06 ***
## s(anthro3c)    7.117  8.103  2.126 0.048737 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.954   Deviance explained = 96.7%
## GCV = 7.9464  Scale est. = 5.6498    n = 71
gam2.gcv <- round(bodyfat_gam2$gcv.ubre,3)
gam2.AIC <- round(bodyfat_gam2$aic,3)
gam2.rsq <- round(summary(bodyfat_gam2)$r.sq,3)
gam2.df <- round(sum(summary(bodyfat_gam2)$edf),3)
gam2.stat <- c( gam2.gcv,gam2.AIC,gam2.rsq,gam2.df)

Plot the Model

par(mfrow=c(1,3))
plot(bodyfat_gam2,select=1)
plot(bodyfat_gam2,select=2)
plot(bodyfat_gam2,select=3)

Create a Second Model With Upper Triangle Selection

In this model we will use the triangle selecion from the first part of the problem where we removed correlated variables based upon the upper triange. The waistcirc, kneebreadth, and anthro3a variables are all given smoothing functions as determined by our previous analysis.

bodyfat_gam3 <- gam(DEXfat~ s(waistcirc) + elbowbreadth + s(kneebreadth)+ s(anthro3a), data = body.lower)
summary(bodyfat_gam3)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DEXfat ~ s(waistcirc) + elbowbreadth + s(kneebreadth) + s(anthro3a)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   39.6149     6.0356   6.564 1.57e-08 ***
## elbowbreadth  -1.3570     0.9257  -1.466    0.148    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F  p-value    
## s(waistcirc)   1.000  1.000 66.688 3.10e-12 ***
## s(kneebreadth) 8.884  8.995  7.389 9.25e-08 ***
## s(anthro3a)    1.000  1.000 53.600 2.24e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.927   Deviance explained = 93.9%
## GCV = 10.937  Scale est. = 8.9522    n = 71
gam3.gcv <- round(bodyfat_gam3$gcv.ubre,3)
gam3.AIC <- round(bodyfat_gam3$aic,3)
gam3.rsq <- round(summary(bodyfat_gam3)$r.sq,3)
gam3.df <- round(sum(summary(bodyfat_gam3)$edf),3)
gam3.stat <- c(gam3.gcv,gam3.AIC,gam3.rsq,gam3.df)

Plot the Model

par(mfrow=c(1,3))
plot(bodyfat_gam3,select=1)
plot(bodyfat_gam3,select=2)
plot(bodyfat_gam3,select=3)

Create a Third Model With Lower Triangle Selection

In this model we will use the lower triangle selection variables from our correlation matrix. The variables hipcirc, kneebreadth, and anthro4 will be used with smoothing parameters.

bodyfat_gam4 <- gam(DEXfat~ s(hipcirc) + s(kneebreadth)+ s(anthro4), data = body.upper)
summary(bodyfat_gam4)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DEXfat ~ s(hipcirc) + s(kneebreadth) + s(anthro4)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.7828     0.3405   90.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F  p-value    
## s(hipcirc)     1.779  2.237 26.298 8.75e-10 ***
## s(kneebreadth) 8.711  8.963  4.683 5.73e-05 ***
## s(anthro4)     1.839  2.304 32.181 1.48e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.932   Deviance explained = 94.4%
## GCV = 10.137  Scale est. = 8.2338    n = 71
gam4.gcv <- round(bodyfat_gam4$gcv.ubre,3)
gam4.AIC <- round(bodyfat_gam4$aic,3)
gam4.rsq <- round(summary(bodyfat_gam)$r.sq,3)
gam4.df <- round(sum(summary(bodyfat_gam4)$edf),3)
gam4.stat <- c(gam4.gcv,gam4.AIC,gam4.rsq,gam4.df)

Plot the Model

par(mfrow=c(1,3))
plot(bodyfat_gam4,select=1)
plot(bodyfat_gam4,select=2)
plot(bodyfat_gam4,select=3)

#plot(bodyfat_gam4,select=4)

Table Of GAM Statistics

Answer - Judging by the below table it is apparent that keeping the variables that are highly correlated (waistcirc & hipcirc with anthro3c) reduces the GCV score by 2, lowers the AIC by 20, has relatively the same adjusted \(R^2\) value and has a higher degree of freedom than the upper triangle model. Thus we should keep these correlated variables rather than removing them from the model.

gams.table <- data.frame('Model.1' = gam2.stat,'Lower.Triangle'=gam3.stat,'Upper.Triangle'=gam4.stat)
row.names(gams.table) <- c('GCV','AIC','Adj.Rsq','DF')
gams.table
##         Model.1 Lower.Triangle Upper.Triangle
## GCV       7.946         10.937         10.137
## AIC     343.256        370.665        365.070
## Adj.Rsq   0.954          0.927          0.954
## DF       17.520         10.884         12.330

Part d

Again fit an additive model to the body fat data, but this time for a log-transformed response. Compare the four models, which one is more appropriate? (Hint: use AIC, Adj-R2, residual plots, etc. to compare models).

Answer: The log transformation of the response variable greatly reduces the GCV and AIC of the model. It deos not do any better at explainng variance but the covariates are noticably more smooth than previous models. Interestingly, the kneebreadth variable is no longer significant at the 0.05 level.

Create the Model With a Log-Transformed Response

bodyfat_gam5 <- gam(log(DEXfat)~ waistcirc + s(hipcirc) + s(kneebreadth)+ anthro3a + s(anthro3c), data = bodyfat)
summary(bodyfat_gam5)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(DEXfat) ~ waistcirc + s(hipcirc) + s(kneebreadth) + anthro3a + 
##     s(anthro3c)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.139779   0.237083   9.025  1.8e-12 ***
## waistcirc   0.004418   0.001806   2.447 0.017610 *  
## anthro3a    0.215488   0.054600   3.947 0.000226 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F  p-value    
## s(hipcirc)     2.909  3.616 11.828  8.8e-07 ***
## s(kneebreadth) 2.325  2.962  2.027 0.128320    
## s(anthro3c)    7.358  8.263  4.678 0.000144 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.952   Deviance explained = 96.2%
## GCV = 0.0088137  Scale est. = 0.006878  n = 71
gam5.gcv <- round(bodyfat_gam5$gcv.ubre,3)
gam5.AIC <- round(bodyfat_gam5$aic,3)
gam5.rsq <- round(summary(bodyfat_gam5)$r.sq,3)
gam5.df <- round(sum(summary(bodyfat_gam5)$edf),3)
gam5.stat <- c( gam5.gcv,gam5.AIC,gam5.rsq,gam5.df)

Plot the Model

par(mfrow=c(1,3))
plot(bodyfat_gam5,select=1)
plot(bodyfat_gam5,select=2)
plot(bodyfat_gam5,select=3)

Part e

Fit generalised additive model that underwent AIC-based variable selection (fitted using function gamboost()). What variable was removed by using AIC? bodyfat_boost <- gamboost(DEXfat~., data = bodyfat) bodyfat_aic <- AIC(bodyfat_boost) bf_gam <- bodyfat_boost[mstop(bodyfat_aic)]

Answer: The only variable removed with the GAM Gradient Boost function is age, which was removed on all previous models. Although the AIC has drastically been reduced it is very hard to interpret the results. Judging by the output of the plots, all of the variables have been smoothed to the point of appearing linear aside from elbowbreadth.

Use the gamboost Function to Generate a Generalized Additive Model With the Lowest AIC - Print the Summary

bodyfat_boost <- gamboost(DEXfat~., data = bodyfat) 
bodyfat_aic <- AIC(bodyfat_boost)
bf_gam <- bodyfat_boost[mstop(bodyfat_aic)]

summary(bodyfat_boost)
## 
##   Model-based Boosting
## 
## Call:
## gamboost(formula = DEXfat ~ ., data = bodyfat)
## 
## 
##   Squared Error (Regression) 
## 
## Loss function: (y - f)^2 
##  
## 
## Number of boosting iterations: mstop = 51 
## Step size:  0.1 
## Offset:  30.78282 
## Number of baselearners:  9 
## 
## Selection frequencies:
##  bbs(kneebreadth, df = dfbase)     bbs(anthro3b, df = dfbase) 
##                     0.35294118                     0.17647059 
##      bbs(hipcirc, df = dfbase)     bbs(anthro3a, df = dfbase) 
##                     0.13725490                     0.11764706 
##     bbs(anthro3c, df = dfbase)    bbs(waistcirc, df = dfbase) 
##                     0.09803922                     0.07843137 
## bbs(elbowbreadth, df = dfbase)      bbs(anthro4, df = dfbase) 
##                     0.01960784                     0.01960784

Extract the Variable Names From the Gradient Boosted GAM

extract(bf_gam,what='variable.names')
##    bbs(waistcirc, df = dfbase)      bbs(hipcirc, df = dfbase) 
##                    "waistcirc"                      "hipcirc" 
## bbs(elbowbreadth, df = dfbase)  bbs(kneebreadth, df = dfbase) 
##                 "elbowbreadth"                  "kneebreadth" 
##     bbs(anthro3a, df = dfbase)     bbs(anthro3b, df = dfbase) 
##                     "anthro3a"                     "anthro3b" 
##     bbs(anthro3c, df = dfbase)      bbs(anthro4, df = dfbase) 
##                     "anthro3c"                      "anthro4"

Plot the Model

par(mfrow=c(2,4))
plot(bodyfat_boost,ask=F)

Question 2

Fit a logistic additive model to the glaucoma data. (Here use family = “binomial”). Which covariates should enter the model and what is their influence on the probability of suffering from glaucoma? (Hint: since there are many covariates use gamboost() to fit the GAM model.)

Build the Model and Print the Summary

Answer: Below the selection frequencies show the variables used in the model and their influence on the probability of suffering from glaucoma. Using 10 fold cross validation we also see the prediction error of this model to be about 0.1452 which is higher than the recursive partitioning functions used in last week’s assignment.

data(GlaucomaM)

glauc_boost <- gamboost(Class~., data = GlaucomaM,family=Binomial())

summary(glauc_boost)
## 
##   Model-based Boosting
## 
## Call:
## gamboost(formula = Class ~ ., data = GlaucomaM, family = Binomial())
## 
## 
##   Negative Binomial Likelihood (logit link) 
## 
## Loss function: { 
##      f <- pmin(abs(f), 36) * sign(f) 
##      p <- exp(f)/(exp(f) + exp(-f)) 
##      y <- (y + 1)/2 
##      -y * log(p) - (1 - y) * log(1 - p) 
##  } 
##  
## 
## Number of boosting iterations: mstop = 100 
## Step size:  0.1 
## Offset:  0 
## Number of baselearners:  62 
## 
## Selection frequencies:
##  bbs(tmi, df = dfbase) bbs(mhcg, df = dfbase) bbs(vars, df = dfbase) 
##                   0.17                   0.11                   0.11 
## bbs(mhci, df = dfbase)  bbs(hvc, df = dfbase) bbs(vass, df = dfbase) 
##                   0.10                   0.08                   0.08 
##   bbs(as, df = dfbase) bbs(vari, df = dfbase)   bbs(mv, df = dfbase) 
##                   0.07                   0.06                   0.04 
## bbs(abrs, df = dfbase) bbs(mhcn, df = dfbase) bbs(phcn, df = dfbase) 
##                   0.03                   0.03                   0.03 
##  bbs(mdn, df = dfbase) bbs(phci, df = dfbase)  bbs(hic, df = dfbase) 
##                   0.03                   0.02                   0.01 
## bbs(phcg, df = dfbase)  bbs(mdi, df = dfbase)  bbs(tms, df = dfbase) 
##                   0.01                   0.01                   0.01

Question 3

Investigate the use of different types of scatterplot smoothers on the Hubble data from Chapter 6. (Hint: Look at chapter 10.3.1 (Olympic 1500m Times) in the Handbook).

Answer: The code is straight forward and there is not much to interpret. It is of note that to properly plot the predicted values of the GAM and quadratic model you have to order the values of x and the predicted values of y to generate a clean plot.

Create the Smoothing Functions

data(hubble)
hub_lm <- lm(y~x,data=hubble)
hub_lm2 <- lm(y~x+I(x^2),data=hubble)
hub_gam <- gam(y~s(x,bs='cr'),data=hubble)
hubble_lowess <- lowess(hubble$x,hubble$y)
lm2.pred <- predict(hub_lm2)
gam.pred <- predict(hub_gam)

Plot of a Linear Model on the Hubble Data

plot(y ~ x, data = hubble, main = 'Scatterplot of Hubble X Y Data to a Linear Model')
abline(hub_lm)

Plot of a Loess Smooth Function on the Hubble Data

plot(y ~ x, data = hubble,main='Scatterplot of Hubble X Y Data to a Loess Smooth Function')
lines(hubble_lowess, lty = 2)

Plot of a Quadratic Model on the Hubble Data

plot(y ~ x, data = hubble, main = 'Scatterplot of Hubble X Y Data to a Quadratic Model')
lines(hubble$x[order(hubble$x)], lm2.pred[order(lm2.pred)])

Plot of a Generalized Additive Model on the Hubble Data

plot(y ~ x, data = hubble, main = 'Scatterplot of Hubble X Y Data to a GAM Model')
lines(hubble$x[order(hubble$x)], gam.pred[order(gam.pred)])