(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
#Using CV to select optimal degree for polynomial
library(ISLR)
library(boot)
attach(Wage)
set.seed(5)
all.deltas=rep(NA,10)
for (i in 1:10) {
glm.fit=glm(wage~poly(age,i),data=Wage)
all.deltas[i]=cv.glm(Wage,glm.fit,K=10)$delta[2]
}
plot(1:10,all.deltas,xlab="Degree",ylab="CV Error",type= "s",pch=20,lwd=2,ylim=c(1590,1700))
title("Optimal Degree")
min.point=min(all.deltas)
sd.points=sd(all.deltas)
deg.min=which.min(all.deltas)
points(deg.min,all.deltas[deg.min],col="red",cex=2)
abline(h=min.point+0.2*sd.points,col="red",lty="dashed")
abline(h=min.point-0.2*sd.points,col="red",lty="dashed")
It appears that the optimal degree chosen using K-fold cross-validation with K=10 is 7. Now we will see what we get once we use ANOVA.
# ANOVA
wage.fit1=lm(wage~poly(age,1),data=Wage)
wage.fit2=lm(wage~poly(age,2),data=Wage)
wage.fit3=lm(wage~poly(age,3),data=Wage)
wage.fit4=lm(wage~poly(age,4),data=Wage)
wage.fit5=lm(wage~poly(age,5),data=Wage)
wage.fit6=lm(wage~poly(age,6),data=Wage)
wage.fit7=lm(wage~poly(age,7),data=Wage)
wage.fit8=lm(wage~poly(age,8),data=Wage)
wage.fit9=lm(wage~poly(age,9),data=Wage)
wage.fit10=lm(wage~poly(age,10),data=Wage)
anova(wage.fit1,wage.fit2,wage.fit3,wage.fit4,wage.fit5,wage.fit6,wage.fit7,wage.fit8,wage.fit9,wage.fit10)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
## Model 8: wage ~ poly(age, 8)
## Model 9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.7638 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.9005 0.001669 **
## 4 2995 4771604 1 6070 3.8143 0.050909 .
## 5 2994 4770322 1 1283 0.8059 0.369398
## 6 2993 4766389 1 3932 2.4709 0.116074
## 7 2992 4763834 1 2555 1.6057 0.205199
## 8 2991 4763707 1 127 0.0796 0.777865
## 9 2990 4756703 1 7004 4.4014 0.035994 *
## 10 2989 4756701 1 3 0.0017 0.967529
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It looks like with ANOVA, anything higher than 3 degrees of freedom is not statistically significant. Now we will do the polynomial fit with the 3 degrees of freedom.
# Polynomial Fit with 3 degrees of freedom
plot(wage~age,data=Wage, col="lightblue")
title("Plot with 3 degrees of freedom")
agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
wage.rage=lm(wage~poly(age,3),data=Wage)
pred.wage.rage=predict(wage.rage,data.frame(age=age.grid))
lines(age.grid,pred.wage.rage,col="hotpink",lwd=2)
(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.
# K-fold Cross-validation to figure out the optimal number of cuts
all.cvs=rep(NA,10)
for (i in 2:10) {
Wage$age.cut=cut(Wage$age,i)
lm.fit=glm(wage~age.cut,data=Wage)
all.cvs[i]=cv.glm(Wage,lm.fit,K=10)$delta[2]
}
plot(2:10,all.cvs[-1],xlab="Number of Cuts",ylab="CV Error", type="s",pch=20,lwd=2)
title("Optimal Cuts")
deg.min2=which.min(all.cvs)
points(deg.min2,all.cvs[deg.min2],col="red",cex=2)
K-fold CV has selected 8 cuts as the optimal number of cuts.
# Plot with 8 cuts
wage.cuts=glm(wage~cut(age,8),data=Wage)
agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
wage.rage=lm(wage~poly(age,3),data=Wage)
pred.wage.rage=predict(wage.cuts,data.frame(age=age.grid))
plot(wage~age,data=Wage, col="lightblue")
title("Plot with 8 Cuts")
lines(age.grid,pred.wage.rage,col="hotpink",lwd=2)
detach(Wage)
library(leaps)
attach(College)
set.seed(5)
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward step-wise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
train=sample(1:nrow(College),nrow(College)/2)
test=-train
Outstate.fit=regsubsets(Outstate~.,data=College,subset=train,method='forward')
Outstate.fit.summary=summary(Outstate.fit)
Outstate.fit.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train,
## method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " "*"
## 8 ( 1 ) "*" " " "*" " " " " " " "*"
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " " " " " " "
## 6 ( 1 ) " " "*" " " " " " " "*" " "
## 7 ( 1 ) " " "*" " " " " " " "*" " "
## 8 ( 1 ) " " "*" " " " " " " "*" " "
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" "*"
## 4 ( 1 ) " " "*" "*"
## 5 ( 1 ) "*" "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
#Plots of CP, AdjR2, and BIC to see which number of variables is best
par(mfrow=c(1,3))
plot(Outstate.fit.summary$cp, xlab='Number of Variables',ylab = 'Cp', type='s')
title("CP")
min.cp=min(Outstate.fit.summary$cp)
std.cp=sd(Outstate.fit.summary$cp)
abline(h=min.cp+0.2*std.cp, col='red', lty=2)
plot(Outstate.fit.summary$bic, xlab='Number of Variables',ylab = 'BIC', type='s')
title("BIC")
min.bic=min(Outstate.fit.summary$bic)
std.bic=sd(Outstate.fit.summary$bic)
abline(h=min.bic+0.2*std.bic, col='red', lty=2)
plot(Outstate.fit.summary$adjr2, xlab='Number of Variables',ylab = 'Adj R2', type='s')
title("Adj R2")
max.adjr2=max(Outstate.fit.summary$adjr2)
std.adjr2=sd(Outstate.fit.summary$adjr2)
abline(h= max.adjr2+0.2*std.adjr2, col='red', lty=2)
It looks like the optimal number of variables is 6. For BIC, the smallest amount is at 6, and CP doesn’t get much lower past 6 and Adj R2 doesn’t get much higher past 6.
coef(Outstate.fit,id=6)
## (Intercept) PrivateYes Room.Board Terminal perc.alumni
## -3966.1313289 2577.3747926 0.8314032 36.7025504 59.6240457
## Expend Grad.Rate
## 0.2557118 33.6868606
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
## Warning: package 'gam' was built under R version 4.2.2
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.2.2
## Loaded gam 1.20.2
gam.gam=gam(Outstate~Private+s(Room.Board,5)+ s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow=c(2,3))
plot(gam.gam,se=TRUE, col="magenta")
The variables all seem to have a positive correlation with out-of-state tuition, but Terminal appears to have the smallest, as it doesn’t increase until after 80%. Additionally, the variable expend seems to be non-linear.
(c) Evaluate the model obtained on the test set, and explain the results obtained.
Ctrain=College[train,]
Ctest=College[test,]
gam.gam.preds=predict(gam.gam,Ctest)
gam.gam.err=mean((Ctest$Outstate-gam.gam.preds)^2)
gam.gam.err
## [1] 3602189
gam.gam.tss=mean((Ctest$Outstate-mean(Ctest$Outstate))^2)
test.rss=1-gam.gam.err/gam.gam.tss
test.rss
## [1] 0.7697607
We see here that the MSE for this model is 3602189 and and \(R^2\) is about .7698. This means that around 76.97% of the variance in out-of-state tuition is explained by the 6 variables we used; Private, Room.Board, Terminal, perc.alumni, Expend, and Grad.Rate.
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam.gam)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal,
## 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5),
## data = College, subset = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6943.58 -1034.14 51.71 1192.87 4315.41
##
## (Dispersion Parameter for gaussian family taken to be 3383760)
##
## Null Deviance: 6473143184 on 387 degrees of freedom
## Residual Deviance: 1221537676 on 361.0001 degrees of freedom
## AIC: 6962.496
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1853064551 1853064551 547.635 < 2.2e-16 ***
## s(Room.Board, 5) 1 1326119858 1326119858 391.907 < 2.2e-16 ***
## s(Terminal, 5) 1 510550222 510550222 150.882 < 2.2e-16 ***
## s(perc.alumni, 5) 1 380369558 380369558 112.410 < 2.2e-16 ***
## s(Expend, 5) 1 477706594 477706594 141.176 < 2.2e-16 ***
## s(Grad.Rate, 5) 1 105543744 105543744 31.191 4.607e-08 ***
## Residuals 361 1221537676 3383760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, 5) 4 1.7907 0.1301
## s(Terminal, 5) 4 1.8909 0.1114
## s(perc.alumni, 5) 4 0.4726 0.7559
## s(Expend, 5) 4 12.5702 1.386e-09 ***
## s(Grad.Rate, 5) 4 1.4919 0.2041
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
At the bottom for the Anova for Nonparametric Effects, it shows that the variable Expend is non-linear, which coincides with the plot for Expend in part b. All the other variables have a linear response to out-of-state tuition.