Assignment 6 Ch.7 problems #6,#10

Problem #6

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(ISLR)
library(boot)
attach(Wage)
plot(age, wage)

Here is the plot of age vs wage. Now we need to perform polynomial regression.

library(ISLR)
library(boot)
attach(Wage)

## The following objects are masked from Wage (pos = 3):
## 
##     age, education, health, health_ins, jobclass, logwage, maritl,
##     race, region, wage, year

set.seed(35)
cv.error=rep(0 ,10)
for (i in 1:10){
glm.fit=glm(wage~poly(age,i),data=Wage)
cv.error[i]=cv.glm(Wage ,glm.fit ,K=10) $delta [1]
}

plot(1:10, cv.error, pch = 19, type = "b",xlab = "polynomial degree", ylab = "Estimate of prediction error")

According to the plot, the predicition error appears to stop decreasing after the 4th polynomial degree.

Now we perform an anova to compare the impact of each polynomial degree on the prediction error.

d0 <- lm(wage ~ 1, data = Wage)
d1 <- lm(wage ~ poly(age, 1), data = Wage)
d2 <- lm(wage ~ poly(age, 2), data = Wage)
d3 <- lm(wage ~ poly(age, 3), data = Wage)
d4 <- lm(wage ~ poly(age, 4), data = Wage)
d5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(d0, d1, d2, d3, d4, d5)

## Analysis of Variance Table
## 
## Model 1: wage ~ 1
## Model 2: wage ~ poly(age, 1)
## Model 3: wage ~ poly(age, 2)
## Model 4: wage ~ poly(age, 3)
## Model 5: wage ~ poly(age, 4)
## Model 6: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2999 5222086                                    
## 2   2998 5022216  1    199870 125.4443 < 2.2e-16 ***
## 3   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 4   2996 4777674  1     15756   9.8888  0.001679 ** 
## 5   2995 4771604  1      6070   3.8098  0.051046 .  
## 6   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this anova result it looks like a model with age being a first degree polynomial performs better than a model with age being a zero degree polynomial. A model with age as a second degree polynomial performs better than the model with age as a first degree polynomial. A model with age as a third degree polynomial performs better than the model with age as a second degree polynomial. And the model with age as a fourth degree polynomial performs better than the model with age as a third degree polynomial. A model with age as a fifth degree polynomial does not significantly improve the model.

agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(glm.fit,newdata=list(age=age.grid),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
plot(age,wage,xlim=agelims,cex=.5,col="grey")
title("Polynomial fit using degree 4")
lines(age.grid,preds$fit,lwd=2,col="Red")
matlines(age.grid,se.bands,lwd =1,col="blue",lty =3)

(b) Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(35)
cv.error.20 = rep(NA,19)
for (i in 2:20) {
  Wage$age.cut = cut(Wage$age,i)
  step.fit=glm(wage~age.cut,data=Wage)
  cv.error.20[i-1]=cv.glm(Wage,step.fit,K=10)$delta[1]
}
cv.error.20

##  [1] 1733.994 1683.108 1634.919 1631.032 1623.000 1613.990 1603.455 1612.874
##  [9] 1604.741 1599.980 1605.530 1602.442 1608.302 1604.775 1596.984 1601.898
## [17] 1602.817 1604.728 1604.396

plot(cv.error.20,type='b',ylab="CV Error")

The plot here shows that the model does not improve substantially after the 7th index. We have to use K+1 new variables so we have 8 cuts here.

step.fit = glm(wage~cut(age,8), data=Wage)
preds2=predict(step.fit,newdata=list(age=age.grid), se=T)
se.bands2=cbind(preds2$fit+2*preds2$se.fit,preds2$fit-2*preds2$se.fit)
plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
title("Step function using 8 cuts")
lines(age.grid,preds2$fit,lwd=2,col="Red")
matlines(age.grid,se.bands2,lwd =1,col="blue",lty =3)

Problem #10

10. This question relates to the College data set.

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

First we split the data into training and test sets.

library(caTools)
set.seed(35)
college_sample <- sample.split(College$Outstate, SplitRatio = 0.80)
college_train <- subset(College, college_sample==TRUE) 
college_test <- subset(College, college_sample==FALSE)

Now we perform foward stepwise selection to identify significant predictors.

library(leaps)

## Warning: package 'leaps' was built under R version 4.1.3

fit.fwd <- regsubsets(Outstate~., data=college_train, nvmax=17, method="forward")
fit.summary <- summary(fit.fwd)
fit.summary

## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = college_train, nvmax = 17, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
##           PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 9  ( 1 )  "*"        " "  "*"    " "    " "       " "       " "        
## 10  ( 1 ) "*"        " "  "*"    " "    " "       " "       "*"        
## 11  ( 1 ) "*"        "*"  "*"    " "    " "       " "       "*"        
## 12  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
## 13  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
## 14  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
## 15  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
## 16  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
## 17  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 7  ( 1 )  " "         "*"        " "   "*"      "*" " "      " "      
## 8  ( 1 )  " "         "*"        " "   "*"      "*" "*"      " "      
## 9  ( 1 )  " "         "*"        " "   "*"      "*" "*"      " "      
## 10  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 11  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 12  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 13  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 14  ( 1 ) " "         "*"        "*"   "*"      "*" "*"      " "      
## 15  ( 1 ) " "         "*"        "*"   "*"      "*" "*"      "*"      
## 16  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
## 17  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         "*"    " "      
## 2  ( 1 )  " "         "*"    " "      
## 3  ( 1 )  " "         "*"    " "      
## 4  ( 1 )  "*"         "*"    " "      
## 5  ( 1 )  "*"         "*"    " "      
## 6  ( 1 )  "*"         "*"    "*"      
## 7  ( 1 )  "*"         "*"    "*"      
## 8  ( 1 )  "*"         "*"    "*"      
## 9  ( 1 )  "*"         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"

par(mfrow=c(2,2))
plot(1:17, fit.summary$cp,xlab="Number of Variables",ylab="Cp",main="Cp", type='b')
plot(1:17, fit.summary$bic,xlab="Number of Variables",ylab="BIC",main="BIC", type='b')
plot(1:17, fit.summary$adjr2,xlab="Number of Variables",ylab="Adjusted R2",main="Adjusted R2", type='b')

Based on the graph of CP, BIC, and Adjusted R-squared, it appears that a model the model does not significantly improve after adding more than 7 variables.

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

coef(fit.fwd,7)

##   (Intercept)    PrivateYes    Room.Board      Personal           PhD 
## -3032.6157467  2635.2052422     0.9011922    -0.3334100    36.4534917 
##   perc.alumni        Expend     Grad.Rate 
##    42.8333114     0.2509074    30.9359870

Using these variables, we fit the GAM.

attach(College)
library(gam)

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.20.1

GAM.1 <- gam(Outstate~Private+s(F.Undergrad,4)+s(Room.Board,4)+s(PhD,4)+s(perc.alumni,2)+s(Expend,4)+s(Grad.Rate,5), data=college_train)

par(mfrow=c(3,3))
plot(GAM.1, color = "red",se=T)

Based on these plots it appears that as room and board costs increase, the out of state tuition increases as well. It also appears that as the percent of alumni who donate increase, the out of state tuition increases.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

preds <- predict(GAM.1,newdata = college_test)
mse <- mean((college_test$Outstate - preds)^2)
mse

## [1] 3248632

From this we can see that the average mean square error of the test model is 3,248,632.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

We can see from the plots above that it looks like the Expend variable and the Grad.Rate variable are associated with the response. We can test this hypothesis with ANOVA:

GAM.2 = gam(Outstate~Private+s(F.Undergrad,4)+s(Room.Board,4)+s(PhD,4)+s(perc.alumni,2)+Expend+Grad.Rate, data=college_train) 
GAM.3 = gam(Outstate~Private+s(F.Undergrad,4)+s(Room.Board,4)+s(PhD,4)+s(perc.alumni,2)+s(Expend,4)+s(Grad.Rate,4), data=college_train)
anova(GAM.2,GAM.3, test="F")

## Analysis of Deviance Table
## 
## Model 1: Outstate ~ Private + s(F.Undergrad, 4) + s(Room.Board, 4) + s(PhD, 
##     4) + s(perc.alumni, 2) + Expend + Grad.Rate
## Model 2: Outstate ~ Private + s(F.Undergrad, 4) + s(Room.Board, 4) + s(PhD, 
##     4) + s(perc.alumni, 2) + s(Expend, 4) + s(Grad.Rate, 4)
##   Resid. Df Resid. Dev     Df  Deviance      F    Pr(>F)    
## 1       603 2338817604                                      
## 2       597 2059730017 5.9997 279087587 13.483 2.264e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see from this result that the model where Expend and Grad.Rate are splines(degree 4) improve the model significantly over the model where these variables are linear.

Data Mining Assignment 6

Daniel Bloemker

4/8/2022

Assignment 6 Ch.7 problems #6,#10

Problem #6

Problem #10