library(ISLR)
library(boot)
set.seed(550)
poly.mse=c()
for(degree in 1:7){
poly.fit=glm(wage~poly(age,degree,raw=T),data=Wage)
mse=cv.glm(poly.fit,data = Wage,K=10)$delta[1]
poly.mse=c(poly.mse,mse)
}
plot(poly.mse,xlab='Degree of Polynomial',ylab='Cross Validation Error',type='l')
x=which.min(poly.mse)
points(x,poly.mse[x],pch=20,cex=2,col='red')
According to the graph above the best choice for of polynomial is the 4th degree. The differences between degree 4 and all the degrees higher is extremely slight so choosing the 4th degree is a good choice so that we do not overfit the model as well, this aligns with ANOVA.
set.seed(45)
step.mse=c()
for(br in 2:10){
Wage.model=model.frame(wage~cut(age,br),data=Wage)
names(Wage.model)=c('wage','age')
step.fit=glm(wage~age,data=Wage.model)
mse=cv.glm(step.fit,data = Wage.model,K=10)$delta[1]
step.mse=c(step.mse,mse)
}
plot(step.mse,xlab='Degree of Polynomial',ylab='Cross Validation Error',type='l')
x=which.min(step.mse)
points(x,step.mse[x],pch=20,cex=2,col='red')
By looking at the graph we can see that the best degree of polynomial for the step function is a 7th degree polynomial.
library(leaps)
set.seed(25)
train = sample(1:nrow(College),500)
test = -train
forward=regsubsets(Outstate~.,data=College,method = 'forward',nvmax = 17)
plot(1/nrow(College)*summary(forward)$rss,type='l',xlab='Number of Predictors',ylab='Train MSE Score',xaxt='n')
axis(side=1,at=seq(1,17,2),labels = seq(1,17,2))
By looking at the plot we can see that it starts to level out at roughly 7 predictors. Again, we do not want to overfit the model.
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.20
gam.fit=gam(Outstate~Private+s(Room.Board)+s(Personal)+s(PhD)+s(perc.alumni)+s(Expend)+s(Grad.Rate),data=College[train,])
par(mfrow = c(2, 3))
plot(gam.fit,se=T,col='blue')
It seems that there is a linear effect on perc.alumni and room.board, phd, personal, and grad seem non-linear and the others seem somewhat logarithmic but it is hard to tell.
gam.pred=predict(gam.fit,College[test,])
gam.mse=mean((College[test,'Outstate']-gam.pred)^2)
gam.mse
## [1] 3394629
gam.tss = mean((College[test,'Outstate'] - mean(College[test,'Outstate']))^2)
test.rss = 1 - gam.mse/gam.tss
test.rss
## [1] 0.7609632
Using the test data set actually gives better prediction than the training set. This is due to the MSE being lower on the test set. We also see that 76% of the variance is explained by the model.
summary(gam.fit)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(Personal) +
## s(PhD) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data = College[train,
## ])
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7263.54 -1065.70 -15.66 1242.89 7380.57
##
## (Dispersion Parameter for gaussian family taken to be 3463301)
##
## Null Deviance: 8551607948 on 499 degrees of freedom
## Residual Deviance: 1641603752 on 473.9997 degrees of freedom
## AIC: 8975.105
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2345446286 2345446286 677.229 < 2.2e-16 ***
## s(Room.Board) 1 1830860020 1830860020 528.646 < 2.2e-16 ***
## s(Personal) 1 40714962 40714962 11.756 0.0006594 ***
## s(PhD) 1 596069792 596069792 172.110 < 2.2e-16 ***
## s(perc.alumni) 1 257133836 257133836 74.245 < 2.2e-16 ***
## s(Expend) 1 708778950 708778950 204.654 < 2.2e-16 ***
## s(Grad.Rate) 1 80798374 80798374 23.330 1.845e-06 ***
## Residuals 474 1641603752 3463301
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board) 3 1.5551 0.19951
## s(Personal) 3 1.9590 0.11929
## s(PhD) 3 1.2622 0.28676
## s(perc.alumni) 3 0.9197 0.43115
## s(Expend) 3 28.2764 < 2e-16 ***
## s(Grad.Rate) 3 2.3563 0.07114 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary (shown above) shows that there is no significant evidence that a non-lenear relationship for Grad.Rate, PHD, Personal, Expend, and Room.Board exists.