library(ISLR)
data(College)
set.seed(12345)
train = sample(1:dim(College)[1], dim(College)[1] / 2)
test <- -train
College.train <- College[train, ]
College.test <- College[test, ]
linear_model <- lm(Apps ~ ., data = College.train)
predicted_lm <- predict(linear_model, College.test)
mean((predicted_lm - College.test$Apps)^2) #Test MSE
## [1] 1235794
require(glmnet)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
training_matrix <- model.matrix(Apps ~ ., data = College.train)
testing_matrix <- model.matrix(Apps ~ ., data = College.test)
grid <- 10 ^ seq(4, -2, length = 100)
fit.ridge <- glmnet(training_matrix, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge <- cv.glmnet(training_matrix, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
best_lambda_ridge <- cv.ridge$lambda.min
best_lambda_ridge
## [1] 14.17474
predicted_ridge <- predict(fit.ridge, s = best_lambda_ridge, newx = testing_matrix)
mean((predicted_ridge - College.test$Apps)^2)
## [1] 1278412
require(glmnet)
fit.lasso <- glmnet(training_matrix, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(training_matrix, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
best_lambda_lasso <- cv.lasso$lambda.min
best_lambda_lasso
## [1] 14.17474
predicted_lasso <- predict(fit.lasso, s = best_lambda_lasso, newx = testing_matrix)
mean((predicted_lasso - College.test$Apps)^2)
## [1] 1285065
predict(fit.lasso, s = best_lambda_lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -919.97020748
## (Intercept) .
## PrivateYes -329.91743670
## Accept 1.31238291
## Enroll .
## Top10perc 39.11815643
## Top25perc -11.04449345
## F.Undergrad 0.01474903
## P.Undergrad 0.01674164
## Outstate -0.05725272
## Room.Board 0.12456711
## Books .
## Personal .
## PhD -8.17377950
## Terminal -1.13565234
## S.F.Ratio 28.57739169
## perc.alumni -1.78706262
## Expend 0.10784655
## Grad.Rate 4.95213901
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
fit.pcr <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pcr, val.type = "MSEP")
predicted_pcr <- predict(fit.pcr, College.test, ncomp = 10)
mean((predicted_pcr - College.test$Apps)^2)
## [1] 2709596
fit.pls <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")
predicted_pls <- predict(fit.pls, College.test, ncomp = 10)
mean((predicted_pls - College.test$Apps)^2)
## [1] 1259294
Linear Model: There is a tradeoff between bias and variance can be easily seen by decomposing the mean squared error (MSE). A smaller MSE lead to a better prediction of new values in this linear model and it is smaller than most of the other MSEs (except for PLS). This does not mean that the linear model is the best model.
The results for Lasso, Ridge, and LS are about the same. Lasso reduces the variables “F. Undergrade” and “Books” to zero and shrinks coefficients of other variables.
However, when we calculated the test R^2 value, as shown below, we learn that PCR has a smallest test R2.
From this we can conclude taht all models except PCR can predict the college applications with pretty good accuracy.
test.avg<-mean(College.test[, "Apps"])
1 - mean((College.test[, "Apps"] - predicted_lm)^2)/mean((College.test[, "Apps"] - test.avg)^2)
## [1] 0.9280444
1 - mean((College.test[, "Apps"] -predicted_ridge)^2) /mean((College.test[, "Apps"] - test.avg)^2)
## [1] 0.9255629
1 - mean((College.test[, "Apps"] -predicted_lasso)^2) /mean((College.test[, "Apps"] - test.avg)^2)
## [1] 0.9251756
1 - mean((College.test[, "Apps"] -data.frame(predicted_pcr))^2) /mean((College.test[, "Apps"] -test.avg)^2)
## [1] 0.8422305
1 - mean((College.test[, "Apps"] -data.frame(predicted_pls))^2) /mean((College.test[, "Apps"] -test.avg)^2)
## [1] 0.9266761
require(leaps)
## Loading required package: leaps
college <- read.csv("C:/Users/Kajal/Downloads/College.csv")
set.seed(1234)
rows_for_testing <- sort(sample(1:nrow(college), 100))
test <- college[rows_for_testing,] #Create the testing set
train <- college[-rows_for_testing,]
fwd_college<-regsubsets(Outstate ~ . , data = train, method="forward")
fwd_college
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
reg.summary <- summary(fwd_college)
reg.summary$bic
## [1] -384.2843 -595.2161 -724.4758 -801.8278 -847.3563 -870.3018 -872.1499
## [8] -870.1107
vars <- c("Private", "Room.Board", "Terminal","perc.alumni","Expend", "Grad.Rate")
vars
## [1] "Private" "Room.Board" "Terminal" "perc.alumni" "Expend"
## [6] "Grad.Rate"
As degrees of freedom increases, the graph becomes more “jagged” and “wiggly”. So, after playing with the different splines and degrees of freedom, the GAM I fit came up with these results:
require(gam)
## Loading required package: gam
## Loading required package: splines
## Loaded gam 1.14
gam_outstate_train<-gam(Outstate ~ Private + s(Room.Board, df=2) + s(Terminal, df=3) + s(perc.alumni, df=2) + s(Expend,df=5) + s(Grad.Rate, df=2), data=train)
plot.gam(gam_outstate_train, col="purple")
We evaluate the model used on the test set by comparing the predicted values with the test values. We plot these graphs and expect to see the points distributed across the 45 degree line (the predicted values match the test values).
By looking at the graph, we see that the points do in fact fall along the red 45 degree line, suggesting that the predicted values do represent the test set well and the model is decent.
test$y_hat<-predict(gam_outstate_train, newdata=test)
plot(test$Outstate, test$y_hat) #should be along the 45 degree line
abline(a=0, b=1, col="red")
Looking at the plots of each graph and the summary of the gam fitted to the Out of State Tuition model, we see that the non-parametric anova test shows a strong evidence of non-linear relationship between out of state tuition and Expend. We can see from the plots. The plot for the test expend vs. the test predicted y hat is not a linear (graph on the 45 degree line) graph. Thus, expend is a variable with a non-linear relationship with the response.
plot(test$Room.Board, test$y_hat)
plot(test$Expend, test$y_hat)
plot(test$Terminal, test$y_hat)
plot(test$perc.alumni, test$y_hat)
plot(test$Private, test$y_hat)
plot(test$Grad.Rate, test$y_hat)
summary(gam_outstate_train)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(Terminal,
## df = 3) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate,
## df = 2), data = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7276.9 -1097.8 76.9 1282.5 7783.6
##
## (Dispersion Parameter for gaussian family taken to be 3475052)
##
## Null Deviance: 10732227329 on 676 degrees of freedom
## Residual Deviance: 2297010383 on 661.0002 degrees of freedom
## AIC: 12135.43
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2877118432 2877118432 827.935 < 2.2e-16 ***
## s(Room.Board, df = 2) 1 1986187514 1986187514 571.556 < 2.2e-16 ***
## s(Terminal, df = 3) 1 730539699 730539699 210.224 < 2.2e-16 ***
## s(perc.alumni, df = 2) 1 427230802 427230802 122.942 < 2.2e-16 ***
## s(Expend, df = 5) 1 927305319 927305319 266.846 < 2.2e-16 ***
## s(Grad.Rate, df = 2) 1 159368765 159368765 45.861 2.808e-11 ***
## Residuals 661 2297010383 3475052
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, df = 2) 1 3.4080 0.06533 .
## s(Terminal, df = 3) 2 1.1568 0.31512
## s(perc.alumni, df = 2) 1 0.4664 0.49486
## s(Expend, df = 5) 4 31.1230 < 2e-16 ***
## s(Grad.Rate, df = 2) 1 1.6078 0.20525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1