Chapter 6 Problem 4

  1. As we increase \(\lambda\) from 0, the training RSS will steadily increase because the penalty value will increase and the least squares estimate decreases. If the least squares estimate decreases, then the training RSS increases.
  1. As we increase \(\lambda\) from 0 for the test RSS, we are restricting the \(\beta_j\) coefficients and the model initially dips down and decreases but then follows a U shape because the training set attempts to find a happy medium between overfitting and underfitting the data (the the two ends of the “U” shape of the model).
  1. As we increase \(\lambda\) from 0, there will be a steady decrease in variance because the penalty term increases and we are getting closer to 0. So, variance decreases.
  1. As we increase \(\lambda\) from 0, there is a variance and bias have an inverse relationship; because the variance decreases, the bias increases.
  1. By definiton, the irreducible error is independant of the model, and therefore, it is independant of the value of \(\lambda\), and it will remain constant.

Chapter 6 Question 9

library(ISLR)
data(College)
set.seed(12345)
train = sample(1:dim(College)[1], dim(College)[1] / 2)
test <- -train
College.train <- College[train, ]
College.test <- College[test, ]
linear_model <- lm(Apps ~ ., data = College.train)
predicted_lm <- predict(linear_model, College.test)
mean((predicted_lm - College.test$Apps)^2) #Test MSE
## [1] 1235794
require(glmnet)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
training_matrix <- model.matrix(Apps ~ ., data = College.train)
testing_matrix <- model.matrix(Apps ~ ., data = College.test)

grid <- 10 ^ seq(4, -2, length = 100)
fit.ridge <- glmnet(training_matrix, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge <- cv.glmnet(training_matrix, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
best_lambda_ridge <- cv.ridge$lambda.min
best_lambda_ridge
## [1] 14.17474
predicted_ridge <- predict(fit.ridge, s = best_lambda_ridge, newx = testing_matrix)
mean((predicted_ridge - College.test$Apps)^2)
## [1] 1278412
require(glmnet)
fit.lasso <- glmnet(training_matrix, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(training_matrix, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
best_lambda_lasso <- cv.lasso$lambda.min
best_lambda_lasso
## [1] 14.17474
predicted_lasso <- predict(fit.lasso, s = best_lambda_lasso, newx = testing_matrix)
mean((predicted_lasso - College.test$Apps)^2)
## [1] 1285065
predict(fit.lasso, s = best_lambda_lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -919.97020748
## (Intercept)    .         
## PrivateYes  -329.91743670
## Accept         1.31238291
## Enroll         .         
## Top10perc     39.11815643
## Top25perc    -11.04449345
## F.Undergrad    0.01474903
## P.Undergrad    0.01674164
## Outstate      -0.05725272
## Room.Board     0.12456711
## Books          .         
## Personal       .         
## PhD           -8.17377950
## Terminal      -1.13565234
## S.F.Ratio     28.57739169
## perc.alumni   -1.78706262
## Expend         0.10784655
## Grad.Rate      4.95213901
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
fit.pcr <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pcr, val.type = "MSEP")

predicted_pcr <- predict(fit.pcr, College.test, ncomp = 10)
mean((predicted_pcr - College.test$Apps)^2)
## [1] 2709596
fit.pls <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")

predicted_pls <- predict(fit.pls, College.test, ncomp = 10)
mean((predicted_pls - College.test$Apps)^2)
## [1] 1259294

Linear Model: There is a tradeoff between bias and variance can be easily seen by decomposing the mean squared error (MSE). A smaller MSE lead to a better prediction of new values in this linear model and it is smaller than most of the other MSEs (except for PLS). This does not mean that the linear model is the best model.

The results for Lasso, Ridge, and LS are about the same. Lasso reduces the variables “F. Undergrade” and “Books” to zero and shrinks coefficients of other variables.

However, when we calculated the test R^2 value, as shown below, we learn that PCR has a smallest test R2.

From this we can conclude taht all models except PCR can predict the college applications with pretty good accuracy.

test.avg<-mean(College.test[, "Apps"])
1 - mean((College.test[, "Apps"] - predicted_lm)^2)/mean((College.test[, "Apps"] - test.avg)^2)
## [1] 0.9280444
1 - mean((College.test[, "Apps"] -predicted_ridge)^2) /mean((College.test[, "Apps"] - test.avg)^2)
## [1] 0.9255629
1 - mean((College.test[, "Apps"] -predicted_lasso)^2) /mean((College.test[, "Apps"] - test.avg)^2)
## [1] 0.9251756
1 - mean((College.test[, "Apps"] -data.frame(predicted_pcr))^2) /mean((College.test[, "Apps"] -test.avg)^2)
## [1] 0.8422305
1 - mean((College.test[, "Apps"] -data.frame(predicted_pls))^2) /mean((College.test[, "Apps"] -test.avg)^2)
## [1] 0.9266761

Chapter 7 Problem 10

Part A.

require(leaps)
## Loading required package: leaps
college <- read.csv("C:/Users/Kajal/Downloads/College.csv") 
set.seed(1234)
rows_for_testing <- sort(sample(1:nrow(college), 100)) 
test <- college[rows_for_testing,] #Create the testing set
train <- college[-rows_for_testing,]

fwd_college<-regsubsets(Outstate ~ . , data = train, method="forward")
fwd_college
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
reg.summary <- summary(fwd_college)
reg.summary$bic
## [1] -384.2843 -595.2161 -724.4758 -801.8278 -847.3563 -870.3018 -872.1499
## [8] -870.1107
vars <- c("Private", "Room.Board", "Terminal","perc.alumni","Expend", "Grad.Rate")
vars
## [1] "Private"     "Room.Board"  "Terminal"    "perc.alumni" "Expend"     
## [6] "Grad.Rate"

Part B.

As degrees of freedom increases, the graph becomes more “jagged” and “wiggly”. So, after playing with the different splines and degrees of freedom, the GAM I fit came up with these results:

  • Private: We don’t see anything for Private because it is a factor or categorical variable
  • Room.Board: With two df, we see that Room.Board has a clear linear relationship with out of state tuition
  • Terminal: With three df, we see that as the terminal increases, so does the out of state tuition. However, it is not as straight of a line as Room.Board and Outstate.
  • Perc.Alumni: With two df, we see that similar to Room.Board, the percent of alumni increases as out of state tuition increases
  • Expend: This graph is the most interesting. It increases quickly and then gradually levels off. It is almost quadratic but not as dramatic with the right tail leveling off instead of decreasing. So, as expenditures increase, so does out of state tuition until a certain point, and then the tuition just levels off. With more than 5 degrees of freedom, the graph is extremely wiggly and does not provide any conclusions.
  • Grad.Rate With two df, we see that similar to Room.Board and perc.alumni, as the graduation rate increases, the out of state tuition does as well. Another not so interesting graph.
require(gam)
## Loading required package: gam
## Loading required package: splines
## Loaded gam 1.14
gam_outstate_train<-gam(Outstate ~ Private + s(Room.Board, df=2) + s(Terminal, df=3) + s(perc.alumni, df=2) + s(Expend,df=5) + s(Grad.Rate, df=2), data=train)

plot.gam(gam_outstate_train, col="purple")

Part C.

We evaluate the model used on the test set by comparing the predicted values with the test values. We plot these graphs and expect to see the points distributed across the 45 degree line (the predicted values match the test values).

By looking at the graph, we see that the points do in fact fall along the red 45 degree line, suggesting that the predicted values do represent the test set well and the model is decent.

test$y_hat<-predict(gam_outstate_train, newdata=test)

plot(test$Outstate, test$y_hat) #should be along the 45 degree line
abline(a=0, b=1, col="red")

Part D.

Looking at the plots of each graph and the summary of the gam fitted to the Out of State Tuition model, we see that the non-parametric anova test shows a strong evidence of non-linear relationship between out of state tuition and Expend. We can see from the plots. The plot for the test expend vs. the test predicted y hat is not a linear (graph on the 45 degree line) graph. Thus, expend is a variable with a non-linear relationship with the response.

plot(test$Room.Board, test$y_hat)

plot(test$Expend, test$y_hat)

plot(test$Terminal, test$y_hat)

plot(test$perc.alumni, test$y_hat)

plot(test$Private, test$y_hat)

plot(test$Grad.Rate, test$y_hat)

summary(gam_outstate_train)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(Terminal, 
##     df = 3) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, 
##     df = 2), data = train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -7276.9 -1097.8    76.9  1282.5  7783.6 
## 
## (Dispersion Parameter for gaussian family taken to be 3475052)
## 
##     Null Deviance: 10732227329 on 676 degrees of freedom
## Residual Deviance: 2297010383 on 661.0002 degrees of freedom
## AIC: 12135.43 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                  1 2877118432 2877118432 827.935 < 2.2e-16 ***
## s(Room.Board, df = 2)    1 1986187514 1986187514 571.556 < 2.2e-16 ***
## s(Terminal, df = 3)      1  730539699  730539699 210.224 < 2.2e-16 ***
## s(perc.alumni, df = 2)   1  427230802  427230802 122.942 < 2.2e-16 ***
## s(Expend, df = 5)        1  927305319  927305319 266.846 < 2.2e-16 ***
## s(Grad.Rate, df = 2)     1  159368765  159368765  45.861 2.808e-11 ***
## Residuals              661 2297010383    3475052                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df  Npar F   Pr(F)    
## (Intercept)                                       
## Private                                           
## s(Room.Board, df = 2)        1  3.4080 0.06533 .  
## s(Terminal, df = 3)          2  1.1568 0.31512    
## s(perc.alumni, df = 2)       1  0.4664 0.49486    
## s(Expend, df = 5)            4 31.1230 < 2e-16 ***
## s(Grad.Rate, df = 2)         1  1.6078 0.20525    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1