Chapter 7

Problem 6.

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter. (a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

# Using CARET package to perform Cross Validation with various degrees of polynomials
set.seed(123)
fit_ctrl <- trainControl(method = "cv", number = 10)

cv_mse <- rep(0,10)

for (i in 1:10) {
  poly_model <- train(y = Wage$wage,
                      x = poly(Wage$age, i),
                      method = "lm",
                      metric = "RMSE",
                      trControl = fit_ctrl)
  cv_mse[i] <- poly_model$results$RMSE
}

plot(cv_mse, type="b", xlab="Degrees", ylab="CV RMSE",main="Root Mean Squared Error for different degree ploynomials")
points(x = 3,y=cv_mse[3],col="green",pch = 19,cex=1)

cv_mse
##  [1] 40.89461 39.88213 39.77288 39.81674 39.83530 39.78988 39.88233 39.89711
##  [9] 39.83430 39.85711

A) Cross validation shows that lowest RMSE is at 3 degree polynomial of age.

Now performing ANOVA on the nested formula with different degrees of polynomials of age.

\(\beta_{null}\)::linear model/fit is sufficient \(\beta_{alt}\)::linear model/fit is not sufficient

fit.1 = lm(wage ~ age ,data=Wage)
fit.2 = lm(wage ~ poly(age ,2) ,data=Wage)
fit.3 = lm(wage ~ poly(age ,3) ,data=Wage)
fit.4 = lm(wage ~ poly(age ,4) ,data=Wage)
fit.5 = lm(wage ~ poly(age ,5) ,data=Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova shows that model 4(4th degree polynomial) is sufficient considering that p-value is greater than alpha(0.05). Cross validation shows that lowest RMSE is at model 3.

I have plotted the result from the cross validation by selecting a cubic model and also 4th degree polynomial for comparision.

ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.3,color = "darkgrey") + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 3)") + 
  labs(title = "Wage as function of age - Polynomial(d=3) Regression",
       subtitle = "Predicting 'wage' with a cubic polynomial of 'age'")

ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.3,color = "darkgrey") + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 4)") + 
  labs(title = "Wage as function of age - Polynomial(d=4) Regression",
       subtitle = "Predicting 'wage' with a quadratic polynomial of 'age'")

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(123)
cv_step = rep(0,19)
j <- 1
for (i in 2:20) {
 model_step <- train(y = Wage$wage,
                      x = data.frame(cut(Wage$age, i)),
                      method = "lm",
                      metric = "RMSE",
                      trControl = fit_ctrl)
  cv_step[j] <- model_step$results$RMSE 
  j <- j+1
}

plot(cv_step,type='b',ylab="Step fn CV Root Mean Squared Error")
points(x = 12,y=cv_step[12],col="green",pch = 19,cex=1)

A) Cross validation shows that the lowest RMSE is at 12 cuts of age.

agelims=range(Wage$age)
age_grid=seq(from=agelims[1],to=agelims[2])

step_fit = lm(wage~cut(age,12), data=Wage)

step_preds=predict(step_fit,newdata=list(age=age_grid), se=T)

se.bands=cbind(step_preds$fit+2*step_preds$se.fit,step_preds$fit-2*step_preds$se.fit)
plot(Wage$age,Wage$wage,xlim=agelims,cex=.5,col="darkgrey")
title("Step function using 12 cuts")
lines(age_grid,step_preds$fit,lwd=2,col="#2c7fb8")
matlines(age_grid,se.bands,lwd =1,col="blue",lty =3)

Problem 10.

10. This question relates to the College data set. (a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

#Training and test data set prep
set.seed(1)

myCollege = College
sample = sample.split(myCollege$Outstate, SplitRatio = 0.80)
college_train = subset(myCollege, sample==TRUE) 
college_test = subset(myCollege, sample==FALSE)

# Forward stepwise on the training set using all variables(17).
col.fwd = regsubsets(Outstate~., data=college_train, nvmax=17, method="forward")
col.summary = summary(col.fwd)
# Some Statistical metrics.
which.min(col.summary$cp)    #lower is better.
## [1] 14
which.min(col.summary$bic)   #lower is better.
## [1] 8
which.max(col.summary$adjr2) #higher the better.
## [1] 15
par(mfrow=c(1,3))

plot(col.summary$cp, type="b", xlab="Model#", ylab="Cp",main="Mellow Cp")
points(x =14,y=col.summary$cp[14],col="green",pch = 19,cex=1)
text(12, 200, "low at Model#14",
     cex = .8)
plot(col.summary$bic, type="b", xlab="Model#", ylab="BIC",main="BIC")
points(x =8,y=col.summary$bic[8],col="green",pch = 19,cex=1)
text(13, -700, "low at Model#8",
     cex = .8)

plot(col.summary$adjr2, type="b", xlab="Model#", ylab="AdjR-square",main="Adj R Squared")
points(x =15,y=col.summary$adjr2[15],col="green",pch = 19,cex=1)
text(13, 0.7, "high at Model#15",
     cex = .8)

par(mfrow=c(1,1))
coef(col.fwd,8)
##   (Intercept)    PrivateYes        Accept        Enroll    Room.Board 
## -2613.2528603  2827.3425089     0.4583878    -0.9646202     0.8690132 
##           PhD   perc.alumni        Expend     Grad.Rate 
##    30.1308505    60.6662833     0.2068679    22.3151954

A) I see that Mellow Cp and BIC doesn’t show great improvement after Model# 8 and also Adj r2 doesn’t have great improvement after Model with 8 variables as well. So, i select Model with 8 variables(Simpler Model is Preferred).

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

col_gam = gam(Outstate~Private+
               s(Accept,4)+
               s(Enroll,3)+
               s(Room.Board)+
               s(PhD,3)+
               s(perc.alumni,4)+
               s(Expend,4)+
               s(Grad.Rate,3), data=college_train)
par(mfrow=c(2,3))
plot(col_gam, col="blue", se=T)

A)We can see some predictors like Accept,Expend show a strong non-linear behavior. As the Pct. of faculty with PhD's increase beyond 60 the out of state tuition increases but remains flat between 30 and 60%. Number of applications accepted has nonlinear increase on out of state tuition. Expend- Instructional expenditure beyond certain point has an slight decreasing trend on outstate tuition. perc.alumni and outstate almost have a linear relationship.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

# GAM predictions using the test data set
gam_preds = predict(col_gam,newdata = college_test)
gam_test_mse = mean((college_test$Outstate - gam_preds)^2)

gam_rss <- sum((college_test$Outstate - gam_preds)^2) #residual sum of squares
tss_gam <- sum((college_test$Outstate - mean(college_test$Outstate))^2)#total sum of squares

gam_test_r2 <- 1-(gam_rss/tss_gam)
gam_test_r2
## [1] 0.7878314
#comparing against a plain linear model
col_lm <- lm(Outstate~Private+Accept+Enroll+Room.Board+PhD+perc.alumni+Expend+Grad.Rate, data=college_train) 
lm_preds = predict(col_lm,newdata = college_test)
lm_test_mse = mean((college_test$Outstate - lm_preds)^2)

lm_rss <- sum((college_test$Outstate - lm_preds)^2) #residual sum of squares

lm_test_r2 <- 1-(lm_rss/tss_gam)
lm_test_r2
## [1] 0.7433491

A) Comparing the two metrics, MSE and R-Square, the GAM model is better and also shows that there is non-linearity with some predictors. Below are the statistics::

\(GAM's\ MSE=\) 4.050216410^{6} is less than \(Linear Model's\ MSE=\) 4.899367210^{6}

\(GAM's\ R^2=\) 0.7878314 is greater than \(Linear Model's\ R^2=\) 0.7433491

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(col_gam)
## 
## Call: gam(formula = Outstate ~ Private + s(Accept, 4) + s(Enroll, 3) + 
##     s(Room.Board) + s(PhD, 3) + s(perc.alumni, 4) + s(Expend, 
##     4) + s(Grad.Rate, 3), data = college_train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -6558.2 -1023.1   106.6  1160.9  7958.4 
## 
## (Dispersion Parameter for gaussian family taken to be 3160951)
## 
##     Null Deviance: 9578775148 on 620 degrees of freedom
## Residual Deviance: 1877603286 on 593.9995 degrees of freedom
## AIC: 11084.84 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 2557796574 2557796574 809.186 < 2.2e-16 ***
## s(Accept, 4)        1  731826879  731826879 231.521 < 2.2e-16 ***
## s(Enroll, 3)        1  139797656  139797656  44.227 6.642e-11 ***
## s(Room.Board)       1 1179051601 1179051601 373.005 < 2.2e-16 ***
## s(PhD, 3)           1  482735849  482735849 152.719 < 2.2e-16 ***
## s(perc.alumni, 4)   1  487912630  487912630 154.356 < 2.2e-16 ***
## s(Expend, 4)        1  672178145  672178145 212.651 < 2.2e-16 ***
## s(Grad.Rate, 3)     1   64106292   64106292  20.281 8.052e-06 ***
## Residuals         594 1877603286    3160951                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df Npar F     Pr(F)    
## (Intercept)                                   
## Private                                       
## s(Accept, 4)            3  9.923 2.164e-06 ***
## s(Enroll, 3)            2  5.690  0.003567 ** 
## s(Room.Board)           3  1.682  0.169714    
## s(PhD, 3)               2  1.471  0.230513    
## s(perc.alumni, 4)       3  0.571  0.634481    
## s(Expend, 4)            3 37.030 < 2.2e-16 ***
## s(Grad.Rate, 3)         2  1.196  0.303181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A) To understand the non-linearity, we can run the summary function on the GAM model we have trained. analyzing the Anova for Non-parametric Effects, with \(\alpha=0.05\), we see that only Accept,Enroll and Expend are significant w.r.t modeling with non-linearity(using splines).

Other predictors in the fit, Room.Board,PhD,perc.alumni and Grad.Rate have no significance w.r.t non-linearity.

We can now see if we can improve by fitting GAM again with this above understanding:

col_gam_v2 = gam(Outstate~Private+
               s(Accept,4)+
               s(Enroll,3)+
               Room.Board+
               PhD+
               perc.alumni+
               s(Expend,4)+
               Grad.Rate,data=college_train)

gam_preds2 = predict(col_gam_v2,newdata = college_test)
gam_test_mse2 = mean((college_test$Outstate - gam_preds2)^2)

gam_rss2 <- sum((college_test$Outstate - gam_preds2)^2) #residual sum of squares


gam2_test_r2 <- 1-(gam_rss2/tss_gam)
gam2_test_r2
## [1] 0.7839389

After refitting GAM,no significant improvement(almost same as previous) is seen in the model’s predictive power.