R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Load Libraries
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(pls)
## Warning: package 'pls' was built under R version 4.3.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
library(leaps)
## Warning: package 'leaps' was built under R version 4.3.3
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
## 
##     Boston
# Question 2
# For parts (a) through (c), indicate which of i. through iv. is correct.
# Justify your answer.
# (a) The lasso, relative to least squares, is:
# i. More flexible and hence will give improved prediction accuracy
# when its increase in bias is less than its decrease in
# variance.
# ii. More flexible and hence will give improved prediction accuracy
# when its increase in variance is less than its decrease
# in bias.
# iii. Less flexible and hence will give improved prediction
# accuracy
# when its increase in bias is less than its decrease in
# variance.
# iv. Less flexible and hence will give improved prediction accuracy
# when its increase in variance is less than its decrease
# in bias.
# Question 2 part (a)
# The lasso, relative to least squares, is: (iii)
# Less flexible and hence will give improved prediction accuracy.
# The lasso is able to reduce model flexibility by shrinking 
# coefficient estimates and even sets some estimates to zero,
# which reduces variance. Even though variance is reduced this method may
# increase bias due to underfitting. Overall the lasso can improve prediction
# accuracy provided that the reduction in the variances offsets the increase 
# in bias. 
# Question 2 part (b) Repeat (a) for ridge regression relative to least squares.
# The ridge regression relative to least squares is: (iii)
# Less flexible and hence will give improved prediction accuracy.
# Ridge regression is similar to lasso but it instead shrinks coefficients 
# toward zero value but they key difference is that it doesn't set coefficients
# to exactly zero. Ridge regression is less flexible due to penalization of 
# large coeffiencts, which causes reduction in variance and increase in bias.
# Ridge regression can improve prediction accuracy provided that the reduction 
# in the variances offsets the increase in bias. 
# Question 2 part (c) Repeat (a) for non-linear methods relative to least 
# squares.
# For non-linear methods relative to least squares is : (ii)
# More flexible and hence will give improved prediction accuracy when its 
# increase in variance is less than its decrease in bias.
# Non-linear models compared to least squares are more flexible and they
# are also less biased as well. 
# Question 9 
# In this exercise, we will predict the number of applications received
# using the other variables in the College data set.
attach(College)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
# Question 9 (a) Split the data set into a training set and a test set.

x = model.matrix(Apps~.,College)[,-1]
y = College$Apps
set.seed(10)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)
College_train = College[train, ]
College_test = College[test, ]
y_test = y[test]
# Question 9 (b) Fit a linear model using least squares on the training set,
# and report the test error obtained.

college_ls_fit = lm(Apps~., data=College, subset=train)
summary(college_ls_fit)
## 
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5139.5  -473.3   -21.1   353.2  7402.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -629.36179  639.35741  -0.984 0.325579    
## PrivateYes  -647.56836  192.17056  -3.370 0.000832 ***
## Accept         1.68912    0.05038  33.530  < 2e-16 ***
## Enroll        -1.02383    0.27721  -3.693 0.000255 ***
## Top10perc     48.19124    8.10714   5.944 6.42e-09 ***
## Top25perc    -10.51538    6.44952  -1.630 0.103865    
## F.Undergrad    0.01992    0.05364   0.371 0.710574    
## P.Undergrad    0.04213    0.05348   0.788 0.431373    
## Outstate      -0.09489    0.02674  -3.549 0.000436 ***
## Room.Board     0.14549    0.07243   2.009 0.045277 *  
## Books          0.06660    0.31115   0.214 0.830623    
## Personal       0.05663    0.09453   0.599 0.549475    
## PhD          -10.11489    7.11588  -1.421 0.156027    
## Terminal      -2.29300    8.03546  -0.285 0.775528    
## S.F.Ratio     22.07117   18.70991   1.180 0.238897    
## perc.alumni    2.08121    6.00673   0.346 0.729179    
## Expend         0.07654    0.01672   4.577 6.45e-06 ***
## Grad.Rate      9.99706    4.49821   2.222 0.026857 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1092 on 370 degrees of freedom
## Multiple R-squared:  0.9395, Adjusted R-squared:  0.9367 
## F-statistic:   338 on 17 and 370 DF,  p-value: < 2.2e-16
pred_app = predict(college_ls_fit, College_test)
lmtest_error = mean((College_test$Apps-pred_app)^2)
lmtest_error
## [1] 1020100
# The test error (MSE) is 1020100
# Question 9 (c) Fit a ridge regression model on the training set, 
# with lambda chosen by cross-validation. Report the test error obtained.


grid = 10^seq(10,-2,length=100)
ridge_mod = glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge_mod)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
cv_college_out = cv.glmnet(x[train,],y[train] ,alpha=0)
best_lambda = cv_college_out$lambda.min
best_lambda
## [1] 411.3927
ridge_pred=predict(ridge_mod, s=best_lambda ,newx=x[test,])
ridge_error = mean((ridge_pred - y_test)^2)
print(ridge_error)
## [1] 985020.1
# The test error (MSE) for the ridge model is 985020.1
# Question 9 (d) Fit a lasso model on the training set, with lambda
# chosen by cross validation. Report the test error obtained, along with
# the number of non-zero coefficient estimates. 
lasso_mod = glmnet(x[train,], y[train], alpha=1, lambda=grid)
summary(lasso_mod)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
cv_out = cv.glmnet(x[train,],y[train],alpha=1)
best_lam2 = cv_out$lambda.min
best_lam2
## [1] 24.66235
lasso_pred = predict(lasso_mod,s=best_lam2, newx=x[test,])
lasso_error = mean((lasso_pred - y_test)^2)
print(lasso_error)
## [1] 1008145
# Number of non-zero coefficeint estimates
out = glmnet(x,y,alpha=1,lambda = grid)
lasso_coef = predict(out,type="coefficients",s=best_lam2)[1:18,]
lasso_coef[lasso_coef!=0]
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -6.324960e+02 -4.087012e+02  1.436837e+00 -1.410240e-01  3.143012e+01 
##     Top25perc   P.Undergrad      Outstate    Room.Board      Personal 
## -8.606536e-01  1.480293e-02 -5.342495e-02  1.205819e-01  4.379135e-05 
##           PhD      Terminal     S.F.Ratio   perc.alumni        Expend 
## -5.121245e+00 -3.371192e+00  2.717231e+00 -1.039648e+00  6.838161e-02 
##     Grad.Rate 
##  4.700317e+00
# The test error (MSE) for the lasso model is 1008145
# Question 9 (e) Fit a PCR model on the training set, with M chosen by
# cross validation. Report the test error obtained, along with the value of M
# selected by cross-validation.

pcr_college = pcr(Apps~., data=College_train , scale=TRUE, validation="CV")
summary(pcr_college)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     4345     2371     2391     2104     1949     1898
## adjCV         4347     4345     2368     2396     2085     1939     1891
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1899     1880     1864      1861      1870      1873      1891
## adjCV     1893     1862     1857      1853      1862      1865      1885
##        14 comps  15 comps  16 comps  17 comps
## CV         1903      1727      1295      1260
## adjCV      1975      1669      1283      1249
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     32.6794    56.94    64.38    70.61    76.27    80.97    84.48    87.54
## Apps   0.9148    71.17    71.36    79.85    81.49    82.73    82.79    83.70
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.50     92.89     94.96     96.81     97.97     98.73     99.39
## Apps    83.86     84.08     84.11     84.11     84.16     84.28     93.08
##       16 comps  17 comps
## X        99.86    100.00
## Apps     93.71     93.95
validationplot(pcr_college, val.type="MSEP")

pcr_pred = predict(pcr_college,x[test,], ncomp = 10)
pcr_error = mean((pcr_pred - y_test)^2)
print(pcr_error)
## [1] 1422699
# The test error (MSE) is 1008145 with a M value of 10
# chosen. M value of 10 was chosen due to the low CV and high variance. 
# Question 9 (f) Fit a PLS model on the training set, with M chosen by
# cross validation. Report the test error obtained, along with the value of M
# selected by cross-validation.
pls_college = plsr(Apps~., data = College_train,scale=TRUE, validation="CV")
validationplot(pls_college, val.type="MSEP")

summary(pls_college)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     2178     1872     1734     1615     1453     1359
## adjCV         4347     2171     1867     1726     1586     1427     1341
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1347     1340     1329      1317      1310      1305      1305
## adjCV     1330     1324     1314      1302      1296      1291      1291
##        14 comps  15 comps  16 comps  17 comps
## CV         1305      1307      1307      1307
## adjCV      1291      1292      1293      1293
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       24.27    38.72    62.64    65.26    69.01    73.96    78.86    82.18
## Apps    76.96    84.31    86.80    91.48    93.37    93.75    93.81    93.84
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       85.35     87.42     89.18     91.41     92.70     94.58     97.16
## Apps    93.88     93.91     93.93     93.94     93.95     93.95     93.95
##       16 comps  17 comps
## X        98.15    100.00
## Apps     93.95     93.95
pls_pred = predict(pls_college,x[test,], ncomp=12)
pls_error = mean((pls_pred - y_test)^2)
print(pls_error)
## [1] 1019435
# The test error (MSE) is 1019435 with a M value of 10
# chosen. M value of 12 was chosen due to the low CV.  
# (g) Comment on the results obtained. How accurately can we predict
# the number of college applications received? Is there much
# difference among the test errors resulting from these five approaches?

errors = c(lmtest_error, ridge_error, lasso_error, pcr_error, pls_error)
names(errors) <- c("lm MSE", "ridge MSE", "lasso MSE", "pcr MSE", "pls MSE")
print(sort(errors))
## ridge MSE lasso MSE   pls MSE    lm MSE   pcr MSE 
##  985020.1 1008144.7 1019435.4 1020099.8 1422698.9
sum_squares = sum((mean(College_test$Apps) - College_test$Apps)^2)
sum_residuals = sum((ridge_pred - College_test$Apps)^2)
ridge_rsquare = 1 - (sum_residuals)/(sum_squares)
print(ridge_rsquare)
## [1] 0.9107905
# The ridge model has the lowest MSE (test error) of the five approaches. 
# The test errors of lasso, pls, and linear are very similar to each other, 
# and then there is a stark contrast with the pcr being the highest test error
# compared to all the other models and thus it is the worst. 
# The ridge model will be selected as a result of lowest MSE. 
# Based on the r-square value of the ridge, it can predict 91.1% of the variance
# of 'apps' (college applications). 
# Need to detach dataset 
detach(College)
# Question 11 We will now try to predict per capita crime rate in the
# Boston data set.
attach(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Question 11 Part a 
# Try out some of the regression methods explored in this chapter,
# such as best subset selection, the lasso, ridge regression, and
# PCR. Present and discuss results for the approaches that you consider.

# Subset selection 
set.seed(7)
predict_regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict_regsubsets(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
mean_cv_errors = apply(cv.errors, 2, mean)
plot(mean_cv_errors, type = "b", xlab = "Number of variables", ylab = "CV error")

which.min(mean_cv_errors)
## [1] 13
# Subset MSE
mean_cv_errors[which.min(mean_cv_errors)]
## [1] 43.26558
# Lasso
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv_lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv_lasso)

coef(cv_lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept) 1.7799525
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.1920089
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .
# Lasso MSE
sqrt(cv_lasso$cvm[cv_lasso$lambda == cv_lasso$lambda.1se])
## [1] 7.765238
# Ridge 
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv_ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv_ridge)

coef(cv_ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  1.523899542
## zn          -0.002949852
## indus        0.029276741
## chas        -0.166526007
## nox          1.874769665
## rm          -0.142852604
## age          0.006207995
## dis         -0.094547258
## rad          0.045932737
## tax          0.002086668
## ptratio      0.071258052
## black       -0.002605281
## lstat        0.035745604
## medv        -0.023480540
# Ridge MSE
sqrt(cv_ridge$cvm[cv_ridge$lambda == cv_ridge$lambda.1se])
## [1] 7.659135
# PCR
pcr_crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr_crime)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.215    7.218    6.784    6.781    6.785    6.802
## adjCV         8.61    7.212    7.215    6.778    6.774    6.780    6.795
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.799    6.667    6.707     6.699     6.710     6.675     6.600
## adjCV    6.791    6.658    6.698     6.689     6.698     6.661     6.586
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
# Propose a model (or set of models) that seem to perform well on
# this data set, and justify your answer. Make sure that you are
# evaluating model performance using validation set error, cross validation,
# or some other reasonable alternative, as opposed to
# using training error.

# See above 
# Looking at the MSE, the ridge had the lowest cross 
# validation and thus it is the optimal model to go on. 
# Part c
# Does your chosen model involve all of the features in the data
# set? Why or why not?
# Ridge model doesn't have specific variable selection process and includes all 
# predictors ant thus all features where used.