head(Wage)
##        year age           maritl     race       education             region
## 231655 2006  18 1. Never Married 1. White    1. < HS Grad 2. Middle Atlantic
## 86582  2004  24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003  45       2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003  43       2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443  2005  50      4. Divorced 1. White      2. HS Grad 2. Middle Atlantic
## 376662 2008  54       2. Married 1. White 4. College Grad 2. Middle Atlantic
##              jobclass         health health_ins  logwage      wage
## 231655  1. Industrial      1. <=Good      2. No 4.318063  75.04315
## 86582  2. Information 2. >=Very Good      2. No 4.255273  70.47602
## 161300  1. Industrial      1. <=Good     1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good     1. Yes 5.041393 154.68529
## 11443  2. Information      1. <=Good     1. Yes 4.318063  75.04315
## 376662 2. Information 2. >=Very Good     1. Yes 4.845098 127.11574
attach(Wage)

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

When deciding the optimal degree, d, for the polynomial we see that 9 is actually the lowest value. 9 would appear to be a very high degree of polynomial so we’ll choose 5, which appears to be the next lowest value. Looking at the anova of the models we see that after model 5 the linear models become insignificant.

plot(age,wage)

library(boot)
set.seed(1)

cv.err = rep(0,10)
for (i in 1:10){
  glm.wage = glm(wage ~ poly(age, i), data = Wage)
  cv.err[i] = cv.glm(Wage, glm.wage, K = 10)$delta[1]
}

plot(1:10, cv.err, pch = 19, type = 'b', xlab = 'degree', ylab = 'error')

which.min(cv.err)
## [1] 9
which.min(cv.err[1:5])
## [1] 5
min.deg = 5
min.glm = glm(wage ~ poly(age,min.deg), data = Wage)
plot(age, wage)

age.range = range(age)

age.predict = seq(from = age.range[1], to = age.range[2], length.out = 100)
wage.predict = predict(min.glm, newdata = list(age = age.predict))
lines(age.predict, wage.predict, col = 'blue')

lm.1=lm(wage~age,data=Wage)
lm.2=lm(wage~poly(age,2),data=Wage)
lm.3=lm(wage~poly(age,3),data=Wage)
lm.4=lm(wage~poly(age,4),data=Wage)
lm.5=lm(wage~poly(age,5),data=Wage)
anova(lm.1,lm.2,lm.3,lm.4,lm.5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

After preforming cross validation it appears that the most optimal number of cuts is 7.

set.seed(1)
step.err = c()
for (i in 2:10){
  wage.cut = model.frame(wage~cut(age,i), data = Wage)
  names(wage.cut)=c('wage','age')
  
  wage.glm = glm(wage~age,data = wage.cut)
  cv.err = cv.glm(wage.glm, data = wage.cut, K=10)$delta[1]
  step.err = c(step.err,cv.err)
}
plot(step.err, pch = 20, type = 'b', xlab = 'degrees', ylab = 'error')

10. This question relates to the College data set.

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

Looking at the graph it appears that a model with 8 predictors is most satisfactory.

set.seed(1)
index = sample(1:nrow(College), 0.7*nrow(College))
college.train = College[index,]
college.test = College[-index,]
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.5
slt.for = regsubsets(Outstate ~ ., data = college.train, method = 'forward')
plot(1/nrow(College)*summary(slt.for)$rss, type = 'b')

which(summary(slt.for)$which[8,])
## (Intercept)  PrivateYes  Room.Board    Personal         PhD    Terminal 
##           1           2          10          12          13          14 
## perc.alumni      Expend   Grad.Rate 
##          16          17          18

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

Looking at the plots we see that most appear to be fairly linear, mostly positive, while others like ‘Personal’ appears to be negative.

library(gam)
## Warning: package 'gam' was built under R version 4.0.3
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.0.3
## Loaded gam 1.20
college.gam = gam(Outstate ~ Private + s(Room.Board) + s(Personal) + s(PhD) + s(Terminal) + 
                    s(perc.alumni) + s(Expend) + s(Grad.Rate), data = college.train)
par(mfrow = c(3,3))
plot(college.gam, se=T, col='orange')

(c) Evaluate the model obtained on the test set, and explain the results obtained.

After evaluating the model on the test set we see that about 77% of the variance in the data can be explained by the model.

mean((predict(college.gam, college.test)- college.test$Outstate)^2)
## [1] 3187899
test.tss = sum((college.test$Outstate - mean(college.test$Outstate))^2)
test.rss = sum((predict(college.gam,newdata=college.test)-college.test$Outstate)^2)
1 - test.rss/test.tss
## [1] 0.7690495

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

Looking at the summary of our gam function we can see evidence of a non-linear relationship for Personal, PhD, Terminal, perc.alumi, and Grad.Rate.

summary(college.gam)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(Personal) + 
##     s(PhD) + s(Terminal) + s(perc.alumni) + s(Expend) + s(Grad.Rate), 
##     data = college.train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7319.46 -1161.67    80.25  1289.53  7094.06 
## 
## (Dispersion Parameter for gaussian family taken to be 3526025)
## 
##     Null Deviance: 9260683704 on 542 degrees of freedom
## Residual Deviance: 1808848954 on 512.9995 degrees of freedom
## AIC: 9758.202 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 2429063410 2429063410 688.896 < 2.2e-16 ***
## s(Room.Board)    1 1738946395 1738946395 493.175 < 2.2e-16 ***
## s(Personal)      1   94642288   94642288  26.841 3.180e-07 ***
## s(PhD)           1  595571621  595571621 168.907 < 2.2e-16 ***
## s(Terminal)      1   40005452   40005452  11.346  0.000813 ***
## s(perc.alumni)   1  397261386  397261386 112.665 < 2.2e-16 ***
## s(Expend)        1  802746515  802746515 227.663 < 2.2e-16 ***
## s(Grad.Rate)     1   82649866   82649866  23.440 1.708e-06 ***
## Residuals      513 1808848954    3526025                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df Npar F  Pr(F)    
## (Intercept)                             
## Private                                 
## s(Room.Board)        3  2.513 0.0578 .  
## s(Personal)          3  1.783 0.1494    
## s(PhD)               3  1.480 0.2191    
## s(Terminal)          3  1.755 0.1548    
## s(perc.alumni)       3  1.132 0.3356    
## s(Expend)            3 32.538 <2e-16 ***
## s(Grad.Rate)         3  1.694 0.1673    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1