Homework 6 - STAT 601

##Kevin Kuipers (Completed byself) ##October 2, 2018

##Problem 1 Consider the body fat data introduced in Chapter 9 ( data from package).

a) Explore the data graphically. 
What variables do you think need to be included for predicting bodyfat? 
(Hint: Are there correlated predictors).

First I will load the needed libraries in order to explore the data. After that I will use a paris plot and ggpairs plot to look at the relatioship between all variables. When fitting a model you do not want the covariates to have a strong relationship between themselves.

It appears there is a positive relationship between DEXfat and many of the covariates. However, elbowbreadth does not seem to have much of a relationship with a correlation coefficient at 0.354. Even though the rest of the variables show a positive relationship for DEXfat it appears that some of the covariates are very strongly correlated amongst themselves. For example anthro3a and anthro4 are highly correlated. This is not conducive to building a good model when some of the covariates have a high correlation amoungst themseleves.

##Problem 1 B)

b) Fit a generalised additive model assuming normal errors using the following code. 


     bodyfat_gam <- gam(DEXfat~ s(age) + s(waistcirc) + s(hipcirc) + 
              s(elbowbreadth) + s(kneebreadth)+ s(anthro3a) +
              s(anthro3c), data = bodyfat)
  
  
    - Assess the \textbf{summary()} and \textbf{plot()} 
    of the model (don't need GGPLOT). 
    Are all covariates informative? Should all covariates be smoothed 
    or should some be included as a linear effect? 
    
    - Report GCV, AIC, adj-R$^2$, and total model degrees of freedom. 
    
    - Use \textbf{gam.check()} function to look at the diagnostic plot. 
    Does it appear that the normality assumption is violated? 
    
    - Write a discussion on all of the above points.

1st Model

I will create the model listed above and output the summary and graphs

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DEXfat ~ s(age) + s(waistcirc) + s(hipcirc) + s(elbowbreadth) + 
##     s(kneebreadth) + s(anthro3a) + s(anthro3c)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.7828     0.2847   108.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df      F  p-value    
## s(age)          1.000  1.000  0.956 0.332964    
## s(waistcirc)    1.000  1.000 10.821 0.001844 ** 
## s(hipcirc)      1.775  2.235  9.917 0.000152 ***
## s(elbowbreadth) 1.000  1.000  0.001 0.972242    
## s(kneebreadth)  8.754  8.960  6.180 3.59e-06 ***
## s(anthro3a)     1.000  1.000 12.966 0.000725 ***
## s(anthro3c)     7.042  8.041  1.798 0.100242    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.953   Deviance explained = 96.7%
## GCV = 8.4354  Scale est. = 5.7538    n = 71

##Summary and Plot Response:

According to the summary of the model not all variables are significant. age & elbowbreadth are not significant at all. antrho3c is not even significant at the 0.05 level. Therefore, I would suggest to remove these 3 covariates from the model. From the plots I would suggest that hipcirc and kneebreadth need smoothing. kneebreadth only needs alittle smoothing with possibly a 2nd degree polynomial. Whereas hipcirc appears to need a possibly a 8th degree polynomial since it contains roughly 8 changes. anthro3c possibly needs smoothing with a 6th degree polynomial. The rest of the variables appear to have a linear affect.

##GCV, AIC, adj-R\(^2\), and total model degrees of freedom

I will create a dataframe with all these values so that way they are in a table format.

##             CGV     AIC    adj.R2 degrees.of.freedom
## GCV.Cp 8.435412 345.708 0.9528156           21.57091

##Response: The adj.R^2 is very high, indicating that the model can explain much of the variance. However, the AIC is also high meaning that the model might need some smoothing especially when putting this in the context that the degrees of freedom are pretty low between 20 and 21. The CGV score is not high but it does suggest that the model could use some tuning.

Diagnostic Plot - gam.check()

Now I will use the gam.check function and look at the diagnostic plot

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 41 iterations.
## The RMS GCV score gradient at convergence was 2.767255e-07 .
## The Hessian was positive definite.
## Model rank =  64 / 64 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                   k'  edf k-index p-value  
## s(age)          9.00 1.00    0.81   0.065 .
## s(waistcirc)    9.00 1.00    0.94   0.255  
## s(hipcirc)      9.00 1.78    1.02   0.505  
## s(elbowbreadth) 9.00 1.00    0.81   0.035 *
## s(kneebreadth)  9.00 8.75    1.08   0.685  
## s(anthro3a)     9.00 1.00    1.09   0.705  
## s(anthro3c)     9.00 7.04    0.89   0.145  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##Repsonse

Overall the model does not look too horrible. The Histogram of the residuals does have a tail. The Resids vs linear pred. has a good spread and is cigar shaped but there does seem to be quite some variance. The summary also suggests that we should remove elbowbreadth and age. In addition, if the edf is close to k then the diminsions of those vairbales need to be increased. So in the case case kneebreadgth and antrho3c are between 7 and 8 which is very close to k’ which is 9. Therefore, there is some evidence of paramater tuning to fit an even better model.

##Problem 1 Part C)

c) Now remove insignificant variables and remove smoothing for some variables. 
Report the summary, plot, GCV, AIC,         adj-R$^2$.
  
  
    bodyfat_gam2 <- gam(DEXfat~ waistcirc + s(hipcirc) + 
                 s(kneebreadth)+ anthro3a +
                 s(anthro3c), data = bodyfat)

##2nd Model

I will fit a model according the parameters listed above. DEXfat is explained by waistcirc, hipcirc, kneebreadth, antrho3a and anthro3c. splines/smoothers will be added to kneebreadth, hipcirc, and antrho3c. The summary and plots of the model will be outputted

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DEXfat ~ waistcirc + s(hipcirc) + s(kneebreadth) + anthro3a + 
##     s(anthro3c)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.19588    7.12570  -1.852 0.069897 .  
## waistcirc     0.19654    0.05425   3.623 0.000676 ***
## anthro3a      6.92774    1.63128   4.247 9.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F  p-value    
## s(hipcirc)     1.610  2.010 10.910 0.000103 ***
## s(kneebreadth) 8.793  8.970  6.780 2.48e-06 ***
## s(anthro3c)    7.117  8.103  2.126 0.048737 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.954   Deviance explained = 96.7%
## GCV = 7.9464  Scale est. = 5.6498    n = 71

##Response: The smoothing of these variables remains consistent with what I stated for the previous model.

##GCV, AIC, adj-R\(^2\), and total model degrees of freedom for bodyfat_gam2

Now I will look that statistics of the second model (bodyfat_gam2)

##        CGV_mod2 AIC_mod2 adj.R2_mod2 degrees.of.freedom_mod2
## GCV.Cp 7.946447 343.2562   0.9536683                17.52001

Diagnostic Plot - gam.check() for bodyfat_gam2

I will output the diagnostic plots of bodfat_game2

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 24 iterations.
## The RMS GCV score gradient at convergence was 0.0001386163 .
## The Hessian was positive definite.
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                  k'  edf k-index p-value
## s(hipcirc)     9.00 1.61    1.01    0.49
## s(kneebreadth) 9.00 8.79    1.06    0.61
## s(anthro3c)    9.00 7.12    0.91    0.16

##My Response:

It does not seem to make whole lot of difference when removing the smoothers and removing some of the variables like age and elbowbreadth. The AIC and CGV only decreased very slightly and the adj.R^2 values are very close. The degrees of freedom is lower in the second model.

##Problem 1 Part D)

d) Again fit an additive model to the body fat data, 
but this time for a log-transformed response. 
Compare the three models, which one is more appropriate? 
(Hint: use Adj-R$^2$, residual plots, etc. to compare models).

##3rd Model

Creating a 3rd model with all the same parameters as the 2nd model but taking the log of DEXfat. I will print a summary and plots of the model

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(DEXfat) ~ waistcirc + s(hipcirc) + s(kneebreadth) + anthro3a + 
##     s(anthro3c)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.139779   0.237083   9.025  1.8e-12 ***
## waistcirc   0.004418   0.001806   2.447 0.017610 *  
## anthro3a    0.215488   0.054600   3.947 0.000226 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F  p-value    
## s(hipcirc)     2.909  3.616 11.828  8.8e-07 ***
## s(kneebreadth) 2.325  2.962  2.027 0.128320    
## s(anthro3c)    7.358  8.263  4.678 0.000144 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.952   Deviance explained = 96.2%
## GCV = 0.0088137  Scale est. = 0.006878  n = 71

##Response It appears some of the variables need less smoothing.

##GCV, AIC, adj-R\(^2\), and total model degrees of freedom for bodyfat_gam3

I will output the statistics of the 3rd model bodyfat_gam3

##           CGV_mod3 AIC_mod3 adj.R2_mod3 degrees.of.freedom_mod3
## GCV.Cp 0.008813659  -136.47   0.9522733                12.59274

##Outputting CGV, AIC, adj-R2, and degrees of freedom all 3 models

Now I will combine all 3 models statistics into one table for comparing models.

##My Repsonse:

It appears that the log transformation of the response variable–DEXfat–greatly reduces the CGV score and the AIC value. The degrees of freedom are also reduced. However, the adj.R2 value for explaining the variance of the data does not seem to increase or decrease from among all 3 models.

##Problem 1 E)

e) Fit a generalised additive model that underwent AIC-based variable selection 
(fitted using function gamboost() function). 
What variable was removed by using AIC? 
 
   bodyfat_boost <- gamboost(DEXfat~., data = bodyfat)
   bodyfat_aic <- AIC(bodyfat_boost)
   bf_gam <- bodyfat_boost[mstop(bodyfat_aic)]

I will fit a generalised additive model that undergoes AIC-based vairable selection using the gamboost() command and this will include all the variables in the data set. I will print a summary of the bodyfat_boost and will plot

## 
##   Model-based Boosting
## 
## Call:
## gamboost(formula = DEXfat ~ ., data = bodyfat)
## 
## 
##   Squared Error (Regression) 
## 
## Loss function: (y - f)^2 
##  
## 
## Number of boosting iterations: mstop = 51 
## Step size:  0.1 
## Offset:  30.78282 
## Number of baselearners:  9 
## 
## Selection frequencies:
##  bbs(kneebreadth, df = dfbase)     bbs(anthro3b, df = dfbase) 
##                     0.35294118                     0.17647059 
##      bbs(hipcirc, df = dfbase)     bbs(anthro3a, df = dfbase) 
##                     0.13725490                     0.11764706 
##     bbs(anthro3c, df = dfbase)    bbs(waistcirc, df = dfbase) 
##                     0.09803922                     0.07843137 
## bbs(elbowbreadth, df = dfbase)      bbs(anthro4, df = dfbase) 
##                     0.01960784                     0.01960784

##AIC of gamboost model

## [1] 3.268173
## Optimal number of boosting iterations: 51 
## Degrees of freedom (for mstop = 51): 7.637287

##All variable names from original data set

##  [1] "age"          "DEXfat"       "waistcirc"    "hipcirc"     
##  [5] "elbowbreadth" "kneebreadth"  "anthro3a"     "anthro3b"    
##  [9] "anthro3c"     "anthro4"

##Variables used under AIC-variable selection

##  bbs(kneebreadth, df = dfbase)     bbs(anthro3b, df = dfbase) 
##                     0.35294118                     0.17647059 
##      bbs(hipcirc, df = dfbase)     bbs(anthro3a, df = dfbase) 
##                     0.13725490                     0.11764706 
##     bbs(anthro3c, df = dfbase)    bbs(waistcirc, df = dfbase) 
##                     0.09803922                     0.07843137 
## bbs(elbowbreadth, df = dfbase)      bbs(anthro4, df = dfbase) 
##                     0.01960784                     0.01960784

##My Response:

It appears that the only variable that was removed for predicting DEXfat was age. The rest of the variables that we manually left out due to high correlation coefficients were left in. elbowbreadth was even left in which did not have a significant correlation coefficient.

##Problem 2

Fit a logistic additive model to the glaucoma data. (Here use family = “binomial”). Which covariates should enter the model and how is their influence on the probability of suffering from glaucoma? (Hint: since there are many covariates, use to fit the GAM model.)

##Fitted Model and Summary

I will use the gamboost function on the GlaucomaM Data set for predicting Gluacoma class with all other variables.

## 
##   Model-based Boosting
## 
## Call:
## gamboost(formula = Class ~ ., data = GlaucomaM, family = Binomial())
## 
## 
##   Negative Binomial Likelihood (logit link) 
## 
## Loss function: { 
##      f <- pmin(abs(f), 36) * sign(f) 
##      p <- exp(f)/(exp(f) + exp(-f)) 
##      y <- (y + 1)/2 
##      -y * log(p) - (1 - y) * log(1 - p) 
##  } 
##  
## 
## Number of boosting iterations: mstop = 100 
## Step size:  0.1 
## Offset:  0 
## Number of baselearners:  62 
## 
## Selection frequencies:
##  bbs(tmi, df = dfbase) bbs(mhcg, df = dfbase) bbs(vars, df = dfbase) 
##                   0.17                   0.11                   0.11 
## bbs(mhci, df = dfbase)  bbs(hvc, df = dfbase) bbs(vass, df = dfbase) 
##                   0.10                   0.08                   0.08 
##   bbs(as, df = dfbase) bbs(vari, df = dfbase)   bbs(mv, df = dfbase) 
##                   0.07                   0.06                   0.04 
## bbs(abrs, df = dfbase) bbs(mhcn, df = dfbase) bbs(phcn, df = dfbase) 
##                   0.03                   0.03                   0.03 
##  bbs(mdn, df = dfbase) bbs(phci, df = dfbase)  bbs(hic, df = dfbase) 
##                   0.03                   0.02                   0.01 
## bbs(phcg, df = dfbase)  bbs(mdi, df = dfbase)  bbs(tms, df = dfbase) 
##                   0.01                   0.01                   0.01

##Plots of the model

##10 Fold Cross Validation & Error Rate

Using the similar method from the previous assignment of looping through the data and applying 10 fold cross validation. The error rate will also be outputted.

## [1] 0.1619048

##Response:

It appears that this error rate is higher than the random forest model used in the last assignment and is higher than the adaboost function that I used in last assignment.

##Problem 3

Investigate the use of different types of scatterplot smoothers on the Hubble data from Chapter 6. (Hint: follow the example on men1500m data scattersmoothers page 199 of Handbook).

Loading the library and dataset hubble and creating a linear regression model so plot the scatter plots with the following models:

##Fitting a Linear Regression Model

##Applying a lowess() command for Smoothing

##Fitting a gam Model

##Fitting a Quadratic Model

##Ploting All Models

In order to fit the the lines correctly displayed on the scatter plots for the predictive models for quadratic and gam, their values need to be in ascending order when doing the plot() command

##Final Response:

It looks like all models fit reasonable well. The data does contain some outliers which is throwing some of the fits off. I would say that the lowess smother tends to fit this data the best.