6. In this exercise, you will further analyze the Wage data set considered throughout this chapter. (a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
# Using CARET package to perform Cross Validation with various degrees of polynomials
set.seed(123)
fit_ctrl <- trainControl(method = "cv", number = 10)
cv_mse <- rep(0,10)
for (i in 1:10) {
poly_model <- train(y = Wage$wage,
x = poly(Wage$age, i),
method = "lm",
metric = "RMSE",
trControl = fit_ctrl)
cv_mse[i] <- poly_model$results$RMSE
}
plot(cv_mse, type="b", xlab="Degrees", ylab="CV RMSE",main="Root Mean Squared Error for different degree ploynomials")
points(x = 3,y=cv_mse[3],col="green",pch = 19,cex=1)
cv_mse
## [1] 40.89461 39.88213 39.77288 39.81674 39.83530 39.78988 39.88233 39.89711
## [9] 39.83430 39.85711
A) Cross validation shows that lowest RMSE is at 3 degree polynomial of age.
Now performing ANOVA on the nested formula with different degrees of polynomials of age.
\(\beta_{null}\)::linear model/fit is sufficient \(\beta_{alt}\)::linear model/fit is not sufficient
fit.1 = lm(wage ~ age ,data=Wage)
fit.2 = lm(wage ~ poly(age ,2) ,data=Wage)
fit.3 = lm(wage ~ poly(age ,3) ,data=Wage)
fit.4 = lm(wage ~ poly(age ,4) ,data=Wage)
fit.5 = lm(wage ~ poly(age ,5) ,data=Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova shows that model 4(4th degree polynomial) is sufficient considering that p-value is greater than alpha(0.05). Cross validation shows that lowest RMSE is at model 3.
I have plotted the result from the cross validation by selecting a cubic model and also 4th degree polynomial for comparision.
ggplot(Wage, aes(x = age, y = wage)) +
geom_point(alpha = 0.3,color = "darkgrey") +
geom_smooth(method = "lm", formula = "y ~ poly(x, 3)") +
labs(title = "Wage as function of age - Polynomial(d=3) Regression",
subtitle = "Predicting 'wage' with a cubic polynomial of 'age'")
ggplot(Wage, aes(x = age, y = wage)) +
geom_point(alpha = 0.3,color = "darkgrey") +
geom_smooth(method = "lm", formula = "y ~ poly(x, 4)") +
labs(title = "Wage as function of age - Polynomial(d=4) Regression",
subtitle = "Predicting 'wage' with a quadratic polynomial of 'age'")
(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.
set.seed(123)
cv_step = rep(0,19)
j <- 1
for (i in 2:20) {
model_step <- train(y = Wage$wage,
x = data.frame(cut(Wage$age, i)),
method = "lm",
metric = "RMSE",
trControl = fit_ctrl)
cv_step[j] <- model_step$results$RMSE
j <- j+1
}
plot(cv_step,type='b',ylab="Step fn CV Root Mean Squared Error")
points(x = 12,y=cv_step[12],col="green",pch = 19,cex=1)
A) Cross validation shows that the lowest RMSE is at 12 cuts of age.
agelims=range(Wage$age)
age_grid=seq(from=agelims[1],to=agelims[2])
step_fit = lm(wage~cut(age,12), data=Wage)
step_preds=predict(step_fit,newdata=list(age=age_grid), se=T)
se.bands=cbind(step_preds$fit+2*step_preds$se.fit,step_preds$fit-2*step_preds$se.fit)
plot(Wage$age,Wage$wage,xlim=agelims,cex=.5,col="darkgrey")
title("Step function using 12 cuts")
lines(age_grid,step_preds$fit,lwd=2,col="#2c7fb8")
matlines(age_grid,se.bands,lwd =1,col="blue",lty =3)
10. This question relates to the College data set. (a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
#Training and test data set prep
set.seed(1)
myCollege = College
sample = sample.split(myCollege$Outstate, SplitRatio = 0.80)
college_train = subset(myCollege, sample==TRUE)
college_test = subset(myCollege, sample==FALSE)
# Forward stepwise on the training set using all variables(17).
col.fwd = regsubsets(Outstate~., data=college_train, nvmax=17, method="forward")
col.summary = summary(col.fwd)
# Some Statistical metrics.
which.min(col.summary$cp) #lower is better.
## [1] 14
which.min(col.summary$bic) #lower is better.
## [1] 8
which.max(col.summary$adjr2) #higher the better.
## [1] 15
par(mfrow=c(1,3))
plot(col.summary$cp, type="b", xlab="Model#", ylab="Cp",main="Mellow Cp")
points(x =14,y=col.summary$cp[14],col="green",pch = 19,cex=1)
text(12, 200, "low at Model#14",
cex = .8)
plot(col.summary$bic, type="b", xlab="Model#", ylab="BIC",main="BIC")
points(x =8,y=col.summary$bic[8],col="green",pch = 19,cex=1)
text(13, -700, "low at Model#8",
cex = .8)
plot(col.summary$adjr2, type="b", xlab="Model#", ylab="AdjR-square",main="Adj R Squared")
points(x =15,y=col.summary$adjr2[15],col="green",pch = 19,cex=1)
text(13, 0.7, "high at Model#15",
cex = .8)
par(mfrow=c(1,1))
coef(col.fwd,8)
## (Intercept) PrivateYes Accept Enroll Room.Board
## -2613.2528603 2827.3425089 0.4583878 -0.9646202 0.8690132
## PhD perc.alumni Expend Grad.Rate
## 30.1308505 60.6662833 0.2068679 22.3151954
A) I see that Mellow Cp and BIC doesn’t show great improvement after Model# 8 and also Adj r2 doesn’t have great improvement after Model with 8 variables as well. So, i select Model with 8 variables(Simpler Model is Preferred).
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
col_gam = gam(Outstate~Private+
s(Accept,4)+
s(Enroll,3)+
s(Room.Board)+
s(PhD,3)+
s(perc.alumni,4)+
s(Expend,4)+
s(Grad.Rate,3), data=college_train)
par(mfrow=c(2,3))
plot(col_gam, col="blue", se=T)
A)We can see some predictors like
Accept,Expend show a strong non-linear behavior. As the Pct. of faculty with PhD's increase beyond 60 the out of state tuition increases but remains flat between 30 and 60%. Number of applications accepted has nonlinear increase on out of state tuition. Expend- Instructional expenditure beyond certain point has an slight decreasing trend on outstate tuition. perc.alumni and outstate almost have a linear relationship.
(c) Evaluate the model obtained on the test set, and explain the results obtained.
# GAM predictions using the test data set
gam_preds = predict(col_gam,newdata = college_test)
gam_test_mse = mean((college_test$Outstate - gam_preds)^2)
gam_rss <- sum((college_test$Outstate - gam_preds)^2) #residual sum of squares
tss_gam <- sum((college_test$Outstate - mean(college_test$Outstate))^2)#total sum of squares
gam_test_r2 <- 1-(gam_rss/tss_gam)
gam_test_r2
## [1] 0.7878314
#comparing against a plain linear model
col_lm <- lm(Outstate~Private+Accept+Enroll+Room.Board+PhD+perc.alumni+Expend+Grad.Rate, data=college_train)
lm_preds = predict(col_lm,newdata = college_test)
lm_test_mse = mean((college_test$Outstate - lm_preds)^2)
lm_rss <- sum((college_test$Outstate - lm_preds)^2) #residual sum of squares
lm_test_r2 <- 1-(lm_rss/tss_gam)
lm_test_r2
## [1] 0.7433491
A) Comparing the two metrics, MSE and R-Square, the GAM model is better and also shows that there is non-linearity with some predictors. Below are the statistics::
\(GAM's\ MSE=\) 4.050216410^{6} is less than \(Linear Model's\ MSE=\) 4.899367210^{6}
\(GAM's\ R^2=\) 0.7878314 is greater than \(Linear Model's\ R^2=\) 0.7433491
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(col_gam)
##
## Call: gam(formula = Outstate ~ Private + s(Accept, 4) + s(Enroll, 3) +
## s(Room.Board) + s(PhD, 3) + s(perc.alumni, 4) + s(Expend,
## 4) + s(Grad.Rate, 3), data = college_train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6558.2 -1023.1 106.6 1160.9 7958.4
##
## (Dispersion Parameter for gaussian family taken to be 3160951)
##
## Null Deviance: 9578775148 on 620 degrees of freedom
## Residual Deviance: 1877603286 on 593.9995 degrees of freedom
## AIC: 11084.84
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2557796574 2557796574 809.186 < 2.2e-16 ***
## s(Accept, 4) 1 731826879 731826879 231.521 < 2.2e-16 ***
## s(Enroll, 3) 1 139797656 139797656 44.227 6.642e-11 ***
## s(Room.Board) 1 1179051601 1179051601 373.005 < 2.2e-16 ***
## s(PhD, 3) 1 482735849 482735849 152.719 < 2.2e-16 ***
## s(perc.alumni, 4) 1 487912630 487912630 154.356 < 2.2e-16 ***
## s(Expend, 4) 1 672178145 672178145 212.651 < 2.2e-16 ***
## s(Grad.Rate, 3) 1 64106292 64106292 20.281 8.052e-06 ***
## Residuals 594 1877603286 3160951
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Accept, 4) 3 9.923 2.164e-06 ***
## s(Enroll, 3) 2 5.690 0.003567 **
## s(Room.Board) 3 1.682 0.169714
## s(PhD, 3) 2 1.471 0.230513
## s(perc.alumni, 4) 3 0.571 0.634481
## s(Expend, 4) 3 37.030 < 2.2e-16 ***
## s(Grad.Rate, 3) 2 1.196 0.303181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A) To understand the non-linearity, we can run the summary function on the GAM model we have trained. analyzing the Anova for Non-parametric Effects, with \(\alpha=0.05\), we see that only Accept,Enroll and Expend are significant w.r.t modeling with non-linearity(using splines).
Other predictors in the fit, Room.Board,PhD,perc.alumni and Grad.Rate have no significance w.r.t non-linearity.
We can now see if we can improve by fitting GAM again with this above understanding:
col_gam_v2 = gam(Outstate~Private+
s(Accept,4)+
s(Enroll,3)+
Room.Board+
PhD+
perc.alumni+
s(Expend,4)+
Grad.Rate,data=college_train)
gam_preds2 = predict(col_gam_v2,newdata = college_test)
gam_test_mse2 = mean((college_test$Outstate - gam_preds2)^2)
gam_rss2 <- sum((college_test$Outstate - gam_preds2)^2) #residual sum of squares
gam2_test_r2 <- 1-(gam_rss2/tss_gam)
gam2_test_r2
## [1] 0.7839389
After refitting GAM,no significant improvement(almost same as previous) is seen in the model’s predictive power.