Assignment6

Question 6a

library(ISLR)
library(boot)

set.seed(550)
poly.mse=c()
for(degree in 1:7){
  poly.fit=glm(wage~poly(age,degree,raw=T),data=Wage)
  mse=cv.glm(poly.fit,data = Wage,K=10)$delta[1]
  poly.mse=c(poly.mse,mse)
}

plot(poly.mse,xlab='Degree of Polynomial',ylab='Cross Validation Error',type='l')
x=which.min(poly.mse)
points(x,poly.mse[x],pch=20,cex=2,col='red')

According to the graph above the best choice for of polynomial is the 4th degree. The differences between degree 4 and all the degrees higher is extremely slight so choosing the 4th degree is a good choice so that we do not overfit the model as well, this aligns with ANOVA.

Question 6b

set.seed(45)
step.mse=c()
for(br in 2:10){
  Wage.model=model.frame(wage~cut(age,br),data=Wage)
  names(Wage.model)=c('wage','age')
  
  step.fit=glm(wage~age,data=Wage.model)
  mse=cv.glm(step.fit,data = Wage.model,K=10)$delta[1]
  step.mse=c(step.mse,mse)
}

plot(step.mse,xlab='Degree of Polynomial',ylab='Cross Validation Error',type='l')
x=which.min(step.mse)
points(x,step.mse[x],pch=20,cex=2,col='red')

By looking at the graph we can see that the best degree of polynomial for the step function is a 7th degree polynomial.

Question 10a

library(leaps)
set.seed(25)
train = sample(1:nrow(College),500)
test = -train

forward=regsubsets(Outstate~.,data=College,method = 'forward',nvmax = 17)

plot(1/nrow(College)*summary(forward)$rss,type='l',xlab='Number of Predictors',ylab='Train MSE Score',xaxt='n')
axis(side=1,at=seq(1,17,2),labels = seq(1,17,2))

By looking at the plot we can see that it starts to level out at roughly 7 predictors. Again, we do not want to overfit the model.

Question 10b

library(gam)

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.20

gam.fit=gam(Outstate~Private+s(Room.Board)+s(Personal)+s(PhD)+s(perc.alumni)+s(Expend)+s(Grad.Rate),data=College[train,])
par(mfrow = c(2, 3))
plot(gam.fit,se=T,col='blue')

It seems that there is a linear effect on perc.alumni and room.board, phd, personal, and grad seem non-linear and the others seem somewhat logarithmic but it is hard to tell.

Question 10c

gam.pred=predict(gam.fit,College[test,])
gam.mse=mean((College[test,'Outstate']-gam.pred)^2)
gam.mse

## [1] 3394629

gam.tss = mean((College[test,'Outstate'] - mean(College[test,'Outstate']))^2)
test.rss = 1 - gam.mse/gam.tss
test.rss

## [1] 0.7609632

Using the test data set actually gives better prediction than the training set. This is due to the MSE being lower on the test set. We also see that 76% of the variance is explained by the model.

Question 10d

summary(gam.fit)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(Personal) + 
##     s(PhD) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data = College[train, 
##     ])
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7263.54 -1065.70   -15.66  1242.89  7380.57 
## 
## (Dispersion Parameter for gaussian family taken to be 3463301)
## 
##     Null Deviance: 8551607948 on 499 degrees of freedom
## Residual Deviance: 1641603752 on 473.9997 degrees of freedom
## AIC: 8975.105 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 2345446286 2345446286 677.229 < 2.2e-16 ***
## s(Room.Board)    1 1830860020 1830860020 528.646 < 2.2e-16 ***
## s(Personal)      1   40714962   40714962  11.756 0.0006594 ***
## s(PhD)           1  596069792  596069792 172.110 < 2.2e-16 ***
## s(perc.alumni)   1  257133836  257133836  74.245 < 2.2e-16 ***
## s(Expend)        1  708778950  708778950 204.654 < 2.2e-16 ***
## s(Grad.Rate)     1   80798374   80798374  23.330 1.845e-06 ***
## Residuals      474 1641603752    3463301                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df  Npar F   Pr(F)    
## (Intercept)                               
## Private                                   
## s(Room.Board)        3  1.5551 0.19951    
## s(Personal)          3  1.9590 0.11929    
## s(PhD)               3  1.2622 0.28676    
## s(perc.alumni)       3  0.9197 0.43115    
## s(Expend)            3 28.2764 < 2e-16 ***
## s(Grad.Rate)         3  2.3563 0.07114 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary (shown above) shows that there is no significant evidence that a non-lenear relationship for Grad.Rate, PHD, Personal, Expend, and Room.Board exists.