library(ISLR)
Let’s run the cross validation over a for loop and save the MSEs.
library(boot)
## Warning: package 'boot' was built under R version 3.6.2
# set seed
set.seed(1)
# 10-fold cross-validation
# Repeat upto 5 degrees to find polynomial degree.
# Use delta[2] for adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.
cv.error = rep(0,7)
for (d in 1:7){
poly.fit = glm(wage~poly(age, d), data = Wage)
cv.error[d] = cv.glm(Wage, poly.fit,K =10)$delta[2]
}
Lets find the optimal degree d by using the minimum cross validation error rate.
mse = min(cv.error)
d = which.min(cv.error)
cat("The optimal degree d is ",d, " and the corresponding lowest 10-fold CV error is ", mse)
## The optimal degree d is 7 and the corresponding lowest 10-fold CV error is 1593.962
The optimal degree shows as 7 when we run up to 7 degrees but with such a high degree overfiting can be possible as the training set has only 4000 observations.
Lets draw a plot to see the cross validation error rates.
plot(cv.error,xlab='Degree of Polynomial',ylab='Cross Validation Error',type='l')
points(d,cv.error[d],pch=20,cex=2,col='red')
points(5,cv.error[5],pch=20,cex=2,col='green')
title("Cross Validation Errors at different degrees of Polynomial")
The error rates show that at degree = 4, the cross validation seems to drop, then at degree = 5, the cross validation error rate reached a low value, then it increased again at degree = 6, and then dropped back. This indicates that overfitting is happening.
So, we conclude that degree = 5 is the OPTIMAL degree.
Lets fit models with different degrees separately and run the ANOVA to compare.
poly.fit.1 = lm(wage~poly(age, 1), data = Wage)
poly.fit.2 = lm(wage~poly(age, 2), data = Wage)
poly.fit.3 = lm(wage~poly(age, 3), data = Wage)
poly.fit.4 = lm(wage~poly(age, 4), data = Wage)
poly.fit.5 = lm(wage~poly(age, 5), data = Wage)
anova(poly.fit.1, poly.fit.2, poly.fit.3, poly.fit.4, poly.fit.5)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The 10-fold CV approach shows that degree 5 having the lowest CV error and the best fit. However, ANOVA test shows that polynomial at degree 3 (with p-value cut off 0.05) is the best fit. This is also the simpler model as we increase the degree there is high possibility we will be overfitting the data.
However, lets confirm which model is best by checking the summary of the degree = 5 polynomial fit.
coef(summary(poly.fit.5))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287647 153.2780243 0.000000e+00
## poly(age, 5)1 447.06785 39.9160847 11.2001930 1.491111e-28
## poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
## poly(age, 5)3 125.52169 39.9160847 3.1446392 1.679213e-03
## poly(age, 5)4 -77.91118 39.9160847 -1.9518743 5.104623e-02
## poly(age, 5)5 -35.81289 39.9160847 -0.8972045 3.696820e-01
In fact the p-values are same for degree=3, degree=4 and degree=5 and at the same time, this summary shows that we can accept the simpler cubic polynomial (i.e. degree=3).
Let’s make a plot of the resulting poynomial fit i.e. polynomial with degree = 3.
agelimits=range(Wage$age)
age.grid=seq(from=agelimits[1],to=agelimits[2])
preds=predict(poly.fit.3,newdata=list(age=age.grid),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
par(mfrow=c(1,1),mar=c(4.5,4.5,1,1),oma=c(0,0,2,0))
plot(Wage$age,Wage$wage,xlim=agelimits,cex =.5,col="darkgrey",xlab='Age',ylab='Wage')
lines(age.grid,preds$fit,lwd=2,col="blue")
matlines(age.grid,se.bands,lwd=1,col="darkgreen",lty=3)
title("Degree-3 Polynomial")
Let’s identify the optimum number of cuts by performing cross validation
# set the seed
set.seed(2)
# Create a vector with NA for no values as we are iterating from 2 to assume minimum of 2 cuts
# Also we dont want the vector to start with 0 value otherwise which.min will throw value as 1
cv.error.cut <- rep(NA, 10)
for (i in 2:10) {
Wage$age.cut <- cut(Wage$age, i)
fit <- glm(wage ~ age.cut, data = Wage)
cv.error.cut[i] <- cv.glm(Wage, fit, K = 10)$delta[2]
}
d.cut <- which.min(cv.error.cut)
plot(2:10, cv.error.cut[-1], xlab = "Cuts", ylab = "MSE", type = "l")
points(d.cut, cv.error.cut[d.cut], col = 'red', cex = 2, pch = 20)
title("Cross Validation Errors at different cuts")
10-fold Cross validation resulted in 8 cuts as the optimal cuts.
Lets fit the step function with 8 cuts and make a plot.
step.fit = glm(wage~cut(age, d.cut), data = Wage)
step.preds = predict(step.fit , newdata=list(age = age.grid), se = TRUE)
plot(Wage$age,Wage$wage, xlim = agelimits, xlab= 'Wage', ylab= 'Age', cex = 0.5, col='darkgrey')
title('Step-function with 8 cuts')
lines(age.grid, step.preds$fit, lwd = 3, col = 'blue')
nrow(College)
## [1] 777
# Get the total number of predictors. This is total columns - response variable (outstate)
p=dim(College)[2]-1
p
## [1] 17
# Split the data into a training and test data set
# set the seed
set.seed(3)
# Create a training and test set with 80% split of training data set and 20% split of test data set
split = 0.80
train <- sample(1:nrow(College), split*nrow(College))
College.train <- College[ train,]
College.test <- College[-train,]
nrow(College.test)
## [1] 156
# load leaps library
library(leaps)
# Perform forward stepwise selection on training set
# Fit the model with all predictors on training dataset and out-of-state tuition as response variable
fit.fwdstep = regsubsets(Outstate~.,data=College.train,method = 'forward',nvmax = p)
# Get the summary
fit.summary<-summary(fit.fwdstep)
# plot for RSS with minimum RSS value highlighted
plot(fit.summary$rss ,xlab="Number of Variables ",ylab="RSS",type="l")
points(which.min(fit.summary$rss),fit.summary$rss[which.min(fit.summary$rss)], col="red",cex=2,pch=20)
points(6,fit.summary$rss[6], col="Blue",cex=2,pch=20)
points(11,fit.summary$rss[11], col="Green",cex=2,pch=20)
axis(side=1,at=seq(1,p,2),labels = seq(1,p,2))
title('Residual Sum of Squares(RSS)')
The above RSS plot indicate that the MSE is monotonically reducing even though we notice that the difference is considerably less from predictor 6 onwards (Blue dot) and very minimal after 11th predictor onwards (Green dot).This indicates that we might have to use validation approach on test dataset to decide. However, lets try with RSquared, Adjusted Rsquared, Mallows Cp and BIC values to determine the best number of predictors using training dataset itself.
Lets take the predictors count where the maximum R2 first reached and is with in the standard deviation.
# Lets create plots for RSquared, Adjusted RSquared, Mallows Cp and BIC to determine best subset of predictors
par(mfrow = c(2, 2))
# plot for Rsquared with maximum R2 value highlighted and within .2 standard deviation limits
sd.r2 = sd(fit.summary$rsq)
max.r2 = max(fit.summary$rsq)
p.r2 = which.max(fit.summary$rsq)
plot(fit.summary$rsq ,xlab="Number of Variables ", ylab="RSquared",type="l", ylim = c(min(fit.summary$rsq),1))
points(p.r2,fit.summary$rsq[p.r2], col="red",cex=2,pch=20)
points(11,fit.summary$rsq[11],col="green",cex=2,pch=20)
abline(h = max.r2 - 0.2 * sd.r2, col = "red", lty = 2)
abline(h = max.r2 + 0.2 * sd.r2, col = "red", lty = 2)
axis(side=1,at=seq(1,p,2),labels = seq(1,p,2))
title('RSquared(R2)')
# plot for Adjusted Rsquared with maximum adjR2 value highlighted and within .2 standard deviation
sd.adjr2 = sd(fit.summary$adjr2)
max.adjr2 = max(fit.summary$adjr2)
p.adjr2 = which.max(fit.summary$adjr2)
plot(fit.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l",ylim=c(min(fit.summary$adjr2),1))
points(p.adjr2,fit.summary$adjr2[p.adjr2], col="red",cex=2,pch=20)
abline(h = max.adjr2 - 0.2 * sd.adjr2, col = "red", lty = 2)
abline(h = max.adjr2 + 0.2 * sd.adjr2, col = "red", lty = 2)
axis(side=1,at=seq(1,p,2),labels = seq(1,p,2))
title('Adjusted RSquare(adjR2)')
# plot for cp with minimum cp value highlighted and within .2 standard deviation
sd.cp = sd(fit.summary$cp)
min.cp = min(fit.summary$cp)
p.cp = which.min(fit.summary$cp)
plot(fit.summary$cp ,xlab="Number of Variables ",ylab="Cp", type='l')
points(p.cp,fit.summary$cp[p.cp],col="red",cex=2,pch=20)
abline(h = min.cp - 0.2 * sd.cp, col = "red", lty = 2)
abline(h = min.cp + 0.2 * sd.cp, col = "red", lty = 2)
axis(side=1,at=seq(1,p,2),labels = seq(1,p,2))
title('Mallows Cp(Cp)')
# plot for BIC with minimum BIC value highlighted and within .2 standard deviation
sd.bic = sd(fit.summary$bic)
min.bic = min(fit.summary$bic)
p.bic = which.min(fit.summary$bic)
plot(fit.summary$bic ,xlab="Number of Variables ",ylab="BIC",type='l')
points(p.bic,fit.summary$bic[p.bic],col="red",cex=2,pch=20)
abline(h = min.bic - 0.2 * sd.bic, col = "red", lty = 2)
abline(h = min.bic + 0.2 * sd.bic, col = "red", lty = 2)
axis(side=1,at=seq(1,p,2),labels = seq(1,p,2))
title('BIC(BIC)')
The above charts indicate that once we add the 6th predictor, the RSquared, Adjusted RSquared, Mallows CP, and BIC are within the .2 standard deviations and we should be able to use 6 predictors as the best subset. However, the best subset seems to be between 11 and 14 from the Adjusted RSquare, Cp and BIC with the training dataset.
Without doing the validation approach on the test dataset, with this limited knowledge, we would like to proceed with 11 predictors as a satisfactory model even though it is usually evident that training dataset monotonically reduces in the error.
Let’s find the 11 predictors and their coefficients.
coef(fit.fwdstep,11)
## (Intercept) PrivateYes Apps Accept Top10perc
## -2170.1566337 2280.5778509 -0.1980874 0.7020644 28.3906163
## F.Undergrad Room.Board Personal Terminal perc.alumni
## -0.2199194 0.8278636 -0.1724561 30.8253124 43.6722984
## Expend Grad.Rate
## 0.1951104 22.7274480
# Load the library gam to use the smoothing spline
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.6.2
## Loaded gam 1.16.1
# Fit GAM on training data and the 11 predictors from 10.a.
# Private is a Qualitative variable and we will leave it as is
set.seed(4)
gam.fit <- gam(Outstate ~ Private + s(Accept, 4) + s(F.Undergrad, 4) + s(Room.Board,3) + s(Personal, 3) + s(PhD, 3) + s(Terminal,3) + s(S.F.Ratio,3) + s(perc.alumni,4) + s(Expend,4) + s(Grad.Rate,4), data = College.train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
par(mfrow=c(3,4))
plot.Gam(gam.fit, se=TRUE, col='blue')
we ran the GAM with smoother splines on all except the Private variable as it is qualitative. The plots indicate that several of the features are close to linear and may not need to use spline functions.
The predictors Personal, PhD, and Expend seems to be the ones that need to use a spline function.
Rest of the predictors seem to be more linear. Infact, the function of Terminal and Room.Board pretty much look linear even though we smoothed them.
We conclude that we could run ANOVA to identify a much better fit by doing different models with linearity, spline function, etc.
Also, running cross validation to see what variables are really required may help in better interpretability and reduce overfitting as the number of observations are limited.
# set the seed
set.seed(5)
# Write predict function for regsubsets
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
# Lets find the MSE by running the original forward step wise model (10a) with 11 predictors on test set
fwd.preds<-predict.regsubsets(fit.fwdstep,newdata=College.test,id=11)
fwd.mse <- mean((College.test$Outstate - fwd.preds)^2)
fwd.mse
## [1] 3736806
# Lets find the MSE by running the GAM model with 11 predictors (10b) on test set
gam.preds<-predict(gam.fit,College.test)
gam.mse <- mean((College.test$Outstate - gam.preds)^2)
gam.mse
## [1] 3266643
Now, lets try to do the best model by doing validation approach running different models on test datset to identify the optimal number of predictors.
# Use model.matrix function and run it on test set and identify the lowest cross validation error
test.mat = model.matrix(Outstate ~., data=College.test)
val.errors = rep(NA,17)
for (i in 1:17){
coefi = coef(fit.fwdstep, id=i)
pred = test.mat[,names(coefi)]%*%coefi
val.errors[i] = mean((College.test$Outstate-pred)^2)
}
which.min(val.errors)
## [1] 6
Let’s identify the 6 predictors.
# Identify the predictors and the coefficients by running the model on the entire College set
fit.6 <- regsubsets(Outstate ~ ., data = College, method = "forward")
coeffs <- coef(fit.6, id = 6)
coeffs
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -3553.2345268 2768.6347025 0.9679086 35.5283359 48.4221031
## Expend Grad.Rate
## 0.2210255 29.7119093
This model entirely changed the predictors set. Lets fit a GAM model on this list and verify linearity.
# GAM with 6 predictors. Leave Private as is and dont apply spline function
gam.fit.6 <- gam(Outstate ~ Private + s(Room.Board,3) + s(PhD, 3) + s(perc.alumni,4) + s(Expend,4) + s(Grad.Rate,4), data = College.train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
par(mfrow=c(2,3))
plot.Gam(gam.fit.6, se=TRUE, col='blue')
This above chart clearly indicates that the spline function is required on ‘Expend’ and ‘Grad.Rate’. All the others seem to be more close to linear.
Lets find the MSE by running it on test dataset.
# Lets find the MSE by running the GAM model with 11 predictors (10b) on test set
gam.preds.6<-predict(gam.fit.6,College.test)
gam.mse.6 <- mean((College.test$Outstate - gam.preds.6)^2)
gam.mse.6
## [1] 3188957
# Get all the MSEs into a matrix and print it
MSE_all <- matrix(c(fwd.mse,gam.mse,gam.mse.6))
dimnames(MSE_all) = list(c("MSE-Fwd step - 10a","MSE-GAM-10b","MSE-GAM with 6 predictors -10c"))
MSE_all
## [,1]
## MSE-Fwd step - 10a 3736806
## MSE-GAM-10b 3266643
## MSE-GAM with 6 predictors -10c 3188957
The original model fit with forward stepwise with training set has monotonically reduced RSS and hence we did validation approach by running different models on test set and identified the best subset of predictors. we ran all the models on the test set. We identified that linearity is not there and hence performed the GAM with spline functions on all predictors except Private. Our assumption is valid. So, we fit the GAM on this and finally achieved a better MSE on the model with 6 predictors. The model fit better on the test dataset.
Let’s confirm the model variability by evaluting the Rsquared.Let’s get the Rsquare by using misctools library.
library(miscTools)
# Calculate R2 for all Regression model predictions for test data set
R2_fwd <- rSquared(College.test$Outstate, resid = as.matrix(College.test$Outstate - fwd.preds))
R2_gam <- rSquared(College.test$Outstate, resid = as.matrix(College.test$Outstate - gam.preds))
R2_gam.6 <- rSquared(College.test$Outstate, resid = as.matrix(College.test$Outstate - gam.preds.6))
# Get all the RSquared into a matrix and print it
R2_all <- matrix(c(R2_fwd,R2_gam,R2_gam.6))
dimnames(R2_all) = list(c("R2-Fwd step - 10a","R2-GAM-10b","R2-GAM with 6 predictors -10c"))
R2_all
## [,1]
## R2-Fwd step - 10a 0.7829694
## R2-GAM-10b 0.8102761
## R2-GAM with 6 predictors -10c 0.8147880
Finally, we conclude that the model performed better on the test dataset and it is clear that linearity doesnt hold much on some of the predictors. We could conclude that 81.5% of variability is explained by this GAM fit with 6 predictors.
# Lets print the summary
summary(gam.fit.6)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 3) + s(PhD,
## 3) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4),
## data = College.train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7358.0067 -1129.6487 0.8298 1236.7461 7924.4294
##
## (Dispersion Parameter for gaussian family taken to be 3566999)
##
## Null Deviance: 9873303076 on 620 degrees of freedom
## Residual Deviance: 2143763985 on 600.9993 degrees of freedom
## AIC: 11153.17
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2605271147 2605271147 730.382 < 2.2e-16 ***
## s(Room.Board, 3) 1 1957860876 1957860876 548.882 < 2.2e-16 ***
## s(PhD, 3) 1 637208618 637208618 178.640 < 2.2e-16 ***
## s(perc.alumni, 4) 1 398355862 398355862 111.678 < 2.2e-16 ***
## s(Expend, 4) 1 783088579 783088579 219.537 < 2.2e-16 ***
## s(Grad.Rate, 4) 1 140623891 140623891 39.424 6.541e-10 ***
## Residuals 601 2143763985 3566999
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, 3) 2 2.632 0.07279 .
## s(PhD, 3) 2 1.568 0.20938
## s(perc.alumni, 4) 3 1.416 0.23702
## s(Expend, 4) 3 33.973 < 2e-16 ***
## s(Grad.Rate, 4) 3 2.179 0.08942 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA shows a strong evidence of non-linear relationship between ‘Outstate’ and ‘Expend’. This is evident even on the GAM plot. Room.Board and Grad.Rate are the next close ones (based on p-value 0.1).