6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.
library(ISLR)
attach(Wage)
(a) Perform polynomial regression to predict wage using age.
# Polynomial Regression and Step Functions
fit=lm(wage~poly(age,4),data=Wage)
coef(summary(fit))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
library(ISLR)
library(boot)
set.seed(1)
cv.error=rep(0,10)
for (i in 1:10){
glm.fit=glm(wage~poly(age ,i),data=Wage)
cv.error[i]=cv.glm(Wage ,glm.fit)$delta [1]
}
cv.error
## [1] 1676.235 1600.529 1595.960 1594.596 1594.879 1594.119 1594.145 1594.975
## [9] 1593.356 1594.232
The cross-validation output shows that the errors drop from degree 1 to degree4. However, after that there is not much reduction in validation test error.
plot(1:10, cv.error, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.error)
points(deg.min, cv.error[deg.min], col = 'red', cex = 2, pch = 19)
The plot shows that minimum of test MSE is at degree =9.
fit.1=lm(wage~age ,data=Wage)
fit.2=lm(wage~poly(age ,2),data=Wage)
fit.3=lm(wage~poly(age ,3),data=Wage)
fit.4=lm(wage~poly(age ,4),data=Wage)
fit.5=lm(wage~poly(age ,5),data=Wage)
fit.5=lm(wage~poly(age ,5),data=Wage)
fit.6=lm(wage~poly(age ,6),data=Wage)
fit.7=lm(wage~poly(age ,7),data=Wage)
fit.8=lm(wage~poly(age ,8),data=Wage)
fit.9=lm(wage~poly(age ,9),data=Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5,fit.6,fit.7,fit.8,fit.9)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
## Model 8: wage ~ poly(age, 8)
## Model 9: wage ~ poly(age, 9)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.8118 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.9038 0.001666 **
## 4 2995 4771604 1 6070 3.8156 0.050870 .
## 5 2994 4770322 1 1283 0.8062 0.369318
## 6 2993 4766389 1 3932 2.4718 0.116014
## 7 2992 4763834 1 2555 1.6062 0.205123
## 8 2991 4763707 1 127 0.0796 0.777829
## 9 2990 4756703 1 7004 4.4028 0.035963 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value comparing model 1 and 2 is less than 0.05, which indicates linear model is not sufficient. The p-value comparing model 2 and 3 is 0.00166, quadratic fit may be insufficient.The comparison between model 3 and degree-4 polynomial is 0.05, while degree=5 may be unnecessary (because p-value is 0.37).Hence cubic or degree-4 polynomial may be a reasonable fit. But lower or higher order may not be justified.
agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(fit.4,newdata=list(age=age.grid),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
par(mfrow=c(1,1),mar=c(4.5,4.5,1,1),oma=c(0,0,4,0))
plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
title("Degree-4 Polynomial",outer=T)
lines(age.grid,preds$fit,lwd=2,col="blue")
matlines(age.grid,se.bands,lwd=1,col="red",lty=3)
(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.
library(ISLR)
library(boot)
set.seed(1)
degree=10
cv.error=rep(NA, degree)
for (i in 2:degree){
Wage$agecut <- cut(age,i)
glm.fit=glm(wage~agecut,data=Wage)
cv.error[i]=cv.glm(Wage ,glm.fit)$delta [1]
}
cv.error
## [1] NA 1734.064 1682.763 1635.894 1631.450 1623.291 1611.604 1601.006
## [9] 1610.213 1604.804
plot(2:degree, cv.error[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.error)
points(deg.min, cv.error[deg.min], col = 'red', cex = 2, pch = 19)
Optimal number of cuts is 8.
plot(wage ~ age, data = Wage, col = "darkgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)
10. This question relates to the College data set.
library(ISLR)
attach(College)
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
#Splitting data into training and test
set.seed(1)
trainingind<-sample(nrow(College), 0.75*nrow(College))
head(trainingind)
## [1] 679 129 509 471 299 270
train<-College[trainingind, ]
test<-College[-trainingind, ]
dim(College)
## [1] 777 18
dim(train)
## [1] 582 18
dim(test)
## [1] 195 18
View(College)
#Performing forward stepwise selection
set.seed(1)
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
regfit.fwd=regsubsets(Outstate~.,data=train, nvmax=17,method ="forward")
summary (regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, nvmax = 17, method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " "*" " " " "
## 9 ( 1 ) "*" " " " " " " "*" " " " "
## 10 ( 1 ) "*" " " "*" " " "*" " " " "
## 11 ( 1 ) "*" "*" "*" " " "*" " " " "
## 12 ( 1 ) "*" "*" "*" " " "*" " " "*"
## 13 ( 1 ) "*" "*" "*" " " "*" " " "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " "*" " " " "
## 6 ( 1 ) " " "*" " " " " "*" " " " "
## 7 ( 1 ) " " "*" " " "*" "*" " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " "
## 9 ( 1 ) " " "*" " " "*" "*" "*" " "
## 10 ( 1 ) " " "*" " " "*" "*" "*" " "
## 11 ( 1 ) " " "*" " " "*" "*" "*" " "
## 12 ( 1 ) " " "*" " " "*" "*" "*" " "
## 13 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 14 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 15 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" " " "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" " "
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
## 9 ( 1 ) "*" "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
reg.summary=summary(regfit.fwd)
which.max(reg.summary$adjr2)
## [1] 13
coef(regfit.fwd,13)
## (Intercept) PrivateYes Apps Accept Top10perc
## -1830.2240567 2314.2340816 -0.3031429 0.7206354 30.0607613
## F.Undergrad Room.Board Personal PhD Terminal
## -0.1338858 0.9298037 -0.3683801 10.1747305 23.4453401
## S.F.Ratio perc.alumni Expend Grad.Rate
## -43.0149304 48.5868427 0.1727903 21.3217639
Based on adjusted R-squared value, the best model is with 13 predictors. The predictors are as shown above: ‘PrivateYes’,‘Apps’,‘Accept’,‘Top10perc’,‘F.Undergrad’,‘Room.Board’,‘Personal’,‘PhD’,‘Terminal’,‘S.F.Ratio’,‘perc.alumni’,‘Expend’ and ‘Grad.Rate’.
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
## Warning: package 'gam' was built under R version 4.1.3
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.1.3
## Loaded gam 1.20.1
gam.mod <- gam(Outstate ~ Private +s(Apps, 5)+s(Accept, 5)+ s(Top10perc, 5)+s(F.Undergrad, 5)+s(Room.Board, 5) +s(Personal, 5)+ s(PhD, 5)+s(Terminal, 5) +s(S.F.Ratio, 5)+ s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data =train)
par(mfrow = c(3,4))
plot(gam.mod, se = TRUE)
(c) Evaluate the model obtained on the test set, and explain the results obtained.
preds=predict (gam.mod,newdata =test)
RSS <- sum((test$Outstate - preds)^2)
TSS <- sum((test$Outstate - mean(test$Outstate))^2)
Rsquared = 1-(RSS/TSS)
Rsquared
## [1] 0.7783117
The R-squared value is 77.83117. This means that the predictors explain 77.8% of the variability.
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam.mod)
##
## Call: gam(formula = Outstate ~ Private + s(Apps, 5) + s(Accept, 5) +
## s(Top10perc, 5) + s(F.Undergrad, 5) + s(Room.Board, 5) +
## s(Personal, 5) + s(PhD, 5) + s(Terminal, 5) + s(S.F.Ratio,
## 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5),
## data = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5956.97 -1082.77 75.16 1113.16 6750.38
##
## (Dispersion Parameter for gaussian family taken to be 3100868)
##
## Null Deviance: 9798277171 on 581 degrees of freedom
## Residual Deviance: 1612449052 on 519.9993 degrees of freedom
## AIC: 10411.35
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2560016475 2560016475 825.5807 < 2.2e-16 ***
## s(Apps, 5) 1 1069142945 1069142945 344.7883 < 2.2e-16 ***
## s(Accept, 5) 1 166528289 166528289 53.7038 8.962e-13 ***
## s(Top10perc, 5) 1 1107015390 1107015390 357.0018 < 2.2e-16 ***
## s(F.Undergrad, 5) 1 322284617 322284617 103.9337 < 2.2e-16 ***
## s(Room.Board, 5) 1 571444919 571444919 184.2855 < 2.2e-16 ***
## s(Personal, 5) 1 50347895 50347895 16.2367 6.425e-05 ***
## s(PhD, 5) 1 57299250 57299250 18.4785 2.052e-05 ***
## s(Terminal, 5) 1 25174827 25174827 8.1186 0.0045549 **
## s(S.F.Ratio, 5) 1 84735247 84735247 27.3263 2.493e-07 ***
## s(perc.alumni, 5) 1 135824401 135824401 43.8021 9.060e-11 ***
## s(Expend, 5) 1 473699785 473699785 152.7636 < 2.2e-16 ***
## s(Grad.Rate, 5) 1 41377937 41377937 13.3440 0.0002855 ***
## Residuals 520 1612449052 3100868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Apps, 5) 4 1.9824 0.095880 .
## s(Accept, 5) 4 16.9375 4.583e-13 ***
## s(Top10perc, 5) 4 1.9224 0.105354
## s(F.Undergrad, 5) 4 4.5613 0.001253 **
## s(Room.Board, 5) 4 1.6018 0.172487
## s(Personal, 5) 4 2.9606 0.019460 *
## s(PhD, 5) 4 2.2780 0.059855 .
## s(Terminal, 5) 4 1.6920 0.150444
## s(S.F.Ratio, 5) 4 3.7650 0.004970 **
## s(perc.alumni, 5) 4 2.3135 0.056509 .
## s(Expend, 5) 4 21.1692 3.331e-16 ***
## s(Grad.Rate, 5) 4 0.9614 0.428241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variables,‘Accept’, ‘F.Undergrad’,‘Personal’,‘S.F.Ratio’ and ‘Expend’ have a non-linear relationship with the ‘Outstate’ variable.’Apps’,‘PhD’ and ‘perc.alumni’ have moderate non-linear relationship with the ‘Outstate’ variable.