1. Random Splitting (Holdout Sample)

1.1 Using the same Salaries{car} data set, load and attach the data set (i.e., attach(Salaries)) into your work environment.

library(carData)
data("Salaries")
attach(Salaries) 

1.2 Enter set.seed(15) so that you get the same results if you run your cross validation commands multiple times. Then create an index vector called “train” which you can use to split the data set into 80% train and 20% test subsets.

set.seed(15)
train = sample(nrow(Salaries),0.80*nrow(Salaries))

1.3 Fit a linear model to predict salary using all remaining variables as predictors, using the train data subset. Store your resulting model in an object named fit.train and display the summary results.

fit.train=lm(salary ~ ., data = Salaries, subset = train)
summary(fit.train)
## 
## Call:
## lm(formula = salary ~ ., data = Salaries, subset = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64470 -13263  -1590  10311  99825 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    66727.2     5192.2  12.851  < 2e-16 ***
## rankAssocProf  14638.7     4614.7   3.172  0.00166 ** 
## rankProf       46355.4     4708.1   9.846  < 2e-16 ***
## disciplineB    16170.5     2616.6   6.180 2.02e-09 ***
## yrs.since.phd    448.7      272.1   1.649  0.10016    
## yrs.service     -468.7      239.1  -1.960  0.05087 .  
## sexMale         3225.5     4450.4   0.725  0.46914    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22380 on 310 degrees of freedom
## Multiple R-squared:  0.456,  Adjusted R-squared:  0.4455 
## F-statistic: 43.31 on 6 and 310 DF,  p-value: < 2.2e-16

1.4 Using the fit.train model, compute the MSE for the full data set and for the train and test subsets. Store the results in objects named full.mse, train.mse and test.mse, respectively. Then, use the c() function to display these three results with their respective labels “Full MSE”, “Train MSE” and “Test MSE

full.mse <- mean((salary-predict(fit.train,Salaries))^2)
train.mse <- mean((salary-predict(fit.train,Salaries))[train]^2)
test.mse <- mean((salary-predict(fit.train,Salaries))[-train]^2)
mse.all = c("Full MSE"=full.mse, "Train MSE"=train.mse, "Test MSE"=test.mse)
mse.all
##  Full MSE Train MSE  Test MSE 
## 500994115 489631119 546019986

1.5 Analyze the difference between these MSE values and briefly comment on your conclusions. Is this what you expected? Why or why not?

Yes this is expected. The MSE is always expected to be lower in the training data than in the test mse. The MSE calculated with the same training data used to develop the model. When models are over-identified the training error tends to be small.

2. K-Fold Cross (KFCV)** and Leave One Out Cross Validation (LOOCV)

2.1 Using the Salaries{car} data set, fit a GLM model to predict salary using all predictors. Display the summary results. Store the results in an object named glm.fit. Tip: when you use the glm() function you need to specify the family and the link function. However, if you don’t specify a family, the gaussian family (i.e., normal distribution) and the “identity” link (i.e., no transformation of the response variable) will be used as defaults. So just use the glm() function exactly how you use the lm() function and the result will be an OLS model.

glm.fit = glm(salary ~ ., data=Salaries)
summary(glm.fit)
## 
## Call:
## glm(formula = salary ~ ., data = Salaries)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -65248  -13211   -1775   10384   99592  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    65955.2     4588.6  14.374  < 2e-16 ***
## rankAssocProf  12907.6     4145.3   3.114  0.00198 ** 
## rankProf       45066.0     4237.5  10.635  < 2e-16 ***
## disciplineB    14417.6     2342.9   6.154 1.88e-09 ***
## yrs.since.phd    535.1      241.0   2.220  0.02698 *  
## yrs.service     -489.5      211.9  -2.310  0.02143 *  
## sexMale         4783.5     3858.7   1.240  0.21584    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 507990599)
## 
##     Null deviance: 3.6330e+11  on 396  degrees of freedom
## Residual deviance: 1.9812e+11  on 390  degrees of freedom
## AIC: 9093.8
## 
## Number of Fisher Scoring iterations: 2

2.2 Using the cv.glm(){boot} function and the glm.fit object above, compute and display the LOOCV MSE (Leave One Out) for this model (stored in the first attribute of the “delta” attribute.

library(boot)
cv.loo <- cv.glm(Salaries,glm.fit)
cv.loo$delta[1]
## [1] 516540699

2.4 Using the same cv.glm(){boot} function and glm.fit model object, compute and display the 10-Fold cross validation MSE for this model.

cv.10K <- cv.glm(Salaries,glm.fit,K=10)
cv.10K$delta[1]
## [1] 517705511

2.5 Compare the differences between the 10FCV result above and this LOOCV result and provide a brief concluding comment. Is there a meaning to the difference between these 2 MSE values? Please explain why or why not. Yes there is a difference. The 10FCV is trained against 90% of the dataset immediately and is less intensive. The data is split into 10 equal parts and 10 models are run. The LOOCV model runs a test for each n. This would lead to a difference in the MSE

Model Variable Selection

3. Collinearity Analysis

3.1 Fit a full model to predict applications using all remaining variables as predictors and name it lm.fit.all.

library(ISLR)
library(perturb)
lm.fit.all <-lm(Apps~Accept+Enroll+Top10perc+Top25perc+F.Undergrad+P.Undergrad+Outstate+ Room.Board+Books+Personal+PhD+Terminal+S.F.Ratio+perc.alumni+Expend+Grad.Rate, College)
colldiag(lm.fit.all, scale = FALSE, center = FALSE, add.intercept = TRUE)
## Condition
## Index    Variance Decomposition Proportions
##                intercept Accept Enroll Top10perc Top25perc F.Undergrad
## 1        1.000 0.000     0.000  0.000  0.000     0.000     0.000      
## 2        2.942 0.000     0.004  0.000  0.000     0.000     0.030      
## 3        5.776 0.000     0.000  0.000  0.000     0.000     0.000      
## 4       11.899 0.000     0.076  0.000  0.000     0.000     0.000      
## 5       16.068 0.000     0.008  0.000  0.000     0.000     0.007      
## 6       16.777 0.000     0.528  0.000  0.000     0.000     0.255      
## 7       24.945 0.000     0.056  0.000  0.000     0.000     0.040      
## 8       81.936 0.000     0.216  0.928  0.000     0.000     0.620      
## 9       96.126 0.000     0.020  0.033  0.000     0.000     0.009      
## 10     666.209 0.000     0.021  0.009  0.005     0.027     0.011      
## 11    1028.079 0.000     0.001  0.000  0.049     0.085     0.000      
## 12    1225.757 0.000     0.000  0.005  0.011     0.018     0.014      
## 13    1844.757 0.000     0.029  0.007  0.002     0.002     0.002      
## 14    2600.748 0.000     0.000  0.004  0.264     0.349     0.002      
## 15    3068.396 0.000     0.005  0.002  0.500     0.434     0.000      
## 16    4639.750 0.000     0.010  0.003  0.094     0.047     0.001      
## 17  172742.443 1.000     0.026  0.009  0.074     0.039     0.008      
##    P.Undergrad Outstate Room.Board Books Personal PhD   Terminal S.F.Ratio
## 1  0.000       0.007    0.000      0.000 0.000    0.000 0.000    0.000    
## 2  0.001       0.009    0.000      0.000 0.000    0.000 0.000    0.000    
## 3  0.000       0.193    0.005      0.000 0.000    0.000 0.000    0.000    
## 4  0.377       0.068    0.047      0.000 0.017    0.000 0.000    0.000    
## 5  0.464       0.299    0.249      0.000 0.031    0.000 0.000    0.000    
## 6  0.079       0.126    0.068      0.000 0.000    0.000 0.000    0.000    
## 7  0.007       0.042    0.216      0.000 0.623    0.000 0.000    0.000    
## 8  0.011       0.003    0.000      0.014 0.013    0.000 0.000    0.000    
## 9  0.000       0.001    0.130      0.787 0.136    0.000 0.000    0.000    
## 10 0.009       0.145    0.101      0.059 0.027    0.029 0.021    0.000    
## 11 0.031       0.005    0.062      0.004 0.025    0.044 0.043    0.000    
## 12 0.016       0.028    0.023      0.002 0.000    0.019 0.009    0.000    
## 13 0.000       0.046    0.025      0.001 0.008    0.004 0.001    0.000    
## 14 0.001       0.002    0.021      0.047 0.007    0.508 0.286    0.004    
## 15 0.003       0.004    0.000      0.003 0.005    0.387 0.564    0.003    
## 16 0.000       0.021    0.012      0.031 0.015    0.009 0.025    0.630    
## 17 0.000       0.002    0.040      0.052 0.094    0.000 0.052    0.362    
##    perc.alumni Expend Grad.Rate
## 1  0.000       0.014  0.000    
## 2  0.000       0.006  0.000    
## 3  0.000       0.627  0.000    
## 4  0.000       0.002  0.000    
## 5  0.000       0.000  0.000    
## 6  0.000       0.003  0.000    
## 7  0.000       0.000  0.000    
## 8  0.000       0.002  0.000    
## 9  0.000       0.000  0.000    
## 10 0.003       0.012  0.037    
## 11 0.001       0.050  0.002    
## 12 0.018       0.038  0.687    
## 13 0.954       0.002  0.126    
## 14 0.011       0.064  0.010    
## 15 0.003       0.074  0.011    
## 16 0.004       0.064  0.051    
## 17 0.006       0.040  0.076

3.2 Does the CI provide evidence of severe multicollinearity with the model? Why or why not? The CI shows severe multicollinearity because most of the CI Indencies are greater than 50.

3.3 Run the same colldiag() diagnostic, but first using scale=FALSE, center=TRUE, add.intercept=FALSE and then again using scale=TRUE, center=TRUE, add.intercept=FALSE. How do your results change. Please explain why these results changed, if they did?

As the scare moved around the CI’s changed because they variables were evaluated both standardized and not.

colldiag(lm.fit.all, scale = FALSE, center = TRUE, add.intercept = FALSE)
## Condition
## Index    Variance Decomposition Proportions
##              Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1      1.000 0.000  0.000  0.000     0.000     0.001       0.000      
## 2      1.117 0.005  0.000  0.000     0.000     0.031       0.001      
## 3      2.495 0.004  0.000  0.000     0.000     0.002       0.000      
## 4      4.887 0.055  0.000  0.000     0.000     0.000       0.741      
## 5      6.211 0.536  0.000  0.000     0.000     0.260       0.097      
## 6      7.853 0.084  0.000  0.000     0.000     0.043       0.059      
## 7     10.142 0.006  0.000  0.000     0.000     0.006       0.032      
## 8     30.689 0.251  0.974  0.000     0.000     0.618       0.009      
## 9     38.650 0.000  0.000  0.000     0.000     0.000       0.000      
## 10   289.187 0.005  0.002  0.022     0.063     0.015       0.017      
## 11   414.511 0.002  0.004  0.013     0.027     0.006       0.035      
## 12   489.918 0.004  0.001  0.022     0.059     0.005       0.004      
## 13   683.191 0.030  0.007  0.002     0.001     0.002       0.000      
## 14  1018.983 0.004  0.006  0.192     0.217     0.004       0.000      
## 15  1155.067 0.014  0.006  0.748     0.633     0.000       0.004      
## 16  2113.142 0.000  0.000  0.000     0.000     0.006       0.000      
##    Outstate Room.Board Books Personal PhD   Terminal S.F.Ratio perc.alumni
## 1  0.038    0.000      0.000 0.000    0.000 0.000    0.000     0.000      
## 2  0.000    0.000      0.000 0.000    0.000 0.000    0.000     0.000      
## 3  0.443    0.002      0.000 0.000    0.000 0.000    0.000     0.000      
## 4  0.024    0.004      0.000 0.002    0.000 0.000    0.000     0.000      
## 5  0.107    0.054      0.000 0.001    0.000 0.000    0.000     0.000      
## 6  0.151    0.852      0.000 0.002    0.000 0.000    0.000     0.000      
## 7  0.013    0.009      0.000 0.933    0.000 0.000    0.000     0.000      
## 8  0.001    0.016      0.000 0.000    0.000 0.000    0.000     0.000      
## 9  0.001    0.016      0.969 0.033    0.000 0.000    0.000     0.000      
## 10 0.112    0.000      0.001 0.004    0.024 0.013    0.000     0.004      
## 11 0.005    0.005      0.003 0.002    0.110 0.079    0.000     0.004      
## 12 0.040    0.012      0.006 0.009    0.010 0.006    0.000     0.023      
## 13 0.044    0.024      0.000 0.007    0.005 0.001    0.000     0.949      
## 14 0.000    0.005      0.014 0.001    0.596 0.588    0.000     0.012      
## 15 0.002    0.001      0.007 0.000    0.246 0.313    0.000     0.001      
## 16 0.020    0.000      0.000 0.006    0.009 0.000    1.000     0.008      
##    Expend Grad.Rate
## 1  0.150  0.000    
## 2  0.021  0.000    
## 3  0.479  0.000    
## 4  0.001  0.000    
## 5  0.003  0.000    
## 6  0.003  0.000    
## 7  0.007  0.000    
## 8  0.004  0.000    
## 9  0.003  0.000    
## 10 0.071  0.026    
## 11 0.000  0.112    
## 12 0.020  0.719    
## 13 0.002  0.136    
## 14 0.024  0.006    
## 15 0.082  0.000    
## 16 0.128  0.000
colldiag(lm.fit.all, scale = TRUE, center = TRUE, add.intercept = FALSE)
## Condition
## Index    Variance Decomposition Proportions
##            Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1    1.000 0.000  0.000  0.004     0.004     0.000       0.000      
## 2    1.161 0.007  0.003  0.000     0.000     0.003       0.018      
## 3    2.117 0.001  0.000  0.000     0.000     0.000       0.007      
## 4    2.370 0.000  0.000  0.023     0.033     0.000       0.061      
## 5    2.389 0.018  0.003  0.001     0.003     0.002       0.002      
## 6    2.488 0.000  0.000  0.000     0.000     0.000       0.026      
## 7    2.946 0.000  0.000  0.006     0.004     0.000       0.001      
## 8    3.008 0.001  0.000  0.008     0.008     0.000       0.256      
## 9    3.163 0.013  0.003  0.028     0.049     0.001       0.435      
## 10   3.609 0.001  0.000  0.001     0.000     0.000       0.090      
## 11   4.096 0.001  0.000  0.000     0.043     0.001       0.014      
## 12   4.879 0.014  0.000  0.001     0.007     0.001       0.012      
## 13   6.027 0.002  0.000  0.010     0.024     0.001       0.000      
## 14   7.118 0.606  0.038  0.124     0.180     0.116       0.027      
## 15   8.045 0.201  0.004  0.780     0.628     0.054       0.029      
## 16  13.732 0.135  0.947  0.013     0.016     0.820       0.024      
##    Outstate Room.Board Books Personal PhD   Terminal S.F.Ratio perc.alumni
## 1  0.006    0.007      0.000 0.001    0.004 0.004    0.006     0.007      
## 2  0.002    0.000      0.002 0.010    0.002 0.002    0.005     0.004      
## 3  0.001    0.011      0.356 0.149    0.006 0.003    0.043     0.010      
## 4  0.012    0.154      0.018 0.029    0.010 0.018    0.003     0.028      
## 5  0.007    0.037      0.005 0.079    0.063 0.059    0.006     0.000      
## 6  0.000    0.013      0.434 0.089    0.004 0.010    0.155     0.002      
## 7  0.005    0.042      0.044 0.562    0.000 0.000    0.046     0.024      
## 8  0.000    0.050      0.068 0.019    0.000 0.001    0.005     0.515      
## 9  0.000    0.057      0.022 0.009    0.014 0.026    0.087     0.024      
## 10 0.025    0.116      0.015 0.033    0.010 0.005    0.287     0.229      
## 11 0.020    0.201      0.003 0.001    0.001 0.003    0.346     0.032      
## 12 0.821    0.290      0.003 0.006    0.000 0.000    0.000     0.080      
## 13 0.000    0.008      0.029 0.002    0.843 0.789    0.007     0.008      
## 14 0.084    0.006      0.000 0.008    0.002 0.003    0.002     0.024      
## 15 0.017    0.000      0.001 0.002    0.038 0.074    0.000     0.004      
## 16 0.001    0.006      0.000 0.001    0.000 0.002    0.002     0.009      
##    Expend Grad.Rate
## 1  0.008  0.009    
## 2  0.000  0.001    
## 3  0.016  0.017    
## 4  0.002  0.013    
## 5  0.001  0.061    
## 6  0.036  0.024    
## 7  0.027  0.269    
## 8  0.004  0.001    
## 9  0.003  0.001    
## 10 0.016  0.468    
## 11 0.531  0.084    
## 12 0.169  0.038    
## 13 0.015  0.008    
## 14 0.056  0.005    
## 15 0.119  0.002    
## 16 0.000  0.000

3.4 Display the lm.fit model summary results and the variance inflation factors (VIF’s) for the predictors in the model.

summary(lm.fit.all)
## 
## Call:
## lm(formula = Apps ~ Accept + Enroll + Top10perc + Top25perc + 
##     F.Undergrad + P.Undergrad + Outstate + Room.Board + Books + 
##     Personal + PhD + Terminal + S.F.Ratio + perc.alumni + Expend + 
##     Grad.Rate, data = College)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5060.6  -429.2   -18.0   309.7  7661.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.948e+02  3.916e+02  -2.285  0.02259 *  
## Accept       1.591e+00  4.103e-02  38.784  < 2e-16 ***
## Enroll      -8.903e-01  1.874e-01  -4.751 2.42e-06 ***
## Top10perc    4.968e+01  5.621e+00   8.838  < 2e-16 ***
## Top25perc   -1.438e+01  4.514e+00  -3.185  0.00151 ** 
## F.Undergrad  7.303e-02  3.267e-02   2.235  0.02570 *  
## P.Undergrad  4.990e-02  3.236e-02   1.542  0.12346    
## Outstate    -1.093e-01  1.804e-02  -6.058 2.17e-09 ***
## Room.Board   1.351e-01  4.846e-02   2.788  0.00544 ** 
## Books       -9.177e-03  2.401e-01  -0.038  0.96952    
## Personal     3.147e-02  6.357e-02   0.495  0.62068    
## PhD         -6.789e+00  4.644e+00  -1.462  0.14417    
## Terminal    -1.371e+00  5.105e+00  -0.269  0.78834    
## S.F.Ratio    2.305e+01  1.293e+01   1.783  0.07506 .  
## perc.alumni -1.170e+00  4.117e+00  -0.284  0.77640    
## Expend       8.196e-02  1.239e-02   6.614 7.06e-11 ***
## Grad.Rate    8.041e+00  2.967e+00   2.710  0.00687 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1049 on 760 degrees of freedom
## Multiple R-squared:  0.928,  Adjusted R-squared:  0.9265 
## F-statistic: 612.1 on 16 and 760 DF,  p-value: < 2.2e-16
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:boot':
## 
##     logit
vif(lm.fit.all)
##      Accept      Enroll   Top10perc   Top25perc F.Undergrad P.Undergrad 
##    7.126046   21.360983    6.928113    5.630782   17.697719    1.709601 
##    Outstate  Room.Board       Books    Personal         PhD    Terminal 
##    3.711710    1.990349    1.107372    1.305168    4.051287    3.979967 
##   S.F.Ratio perc.alumni      Expend   Grad.Rate 
##    1.845581    1.833717    2.950354    1.829794

3.5 Briefly answer: based on your VIF results, is multicollinearity a problem? Why or why not? If so, which variables pose the main problem?

With some variable multicollinearity could be considered a problem because they are well over 5 or 10.

3.6 Fit a reduced model to predict Apps on Enroll and Top10perc only. Name it lm.fit.reduced. Display the CI (using scale=TRUE, center=TRUE, add.intercept=FALSE), model summary results and the VIF’s.

lm.fit.reduced <-lm(Apps~Enroll+Top10perc, data = College)
colldiag(lm.fit.reduced, scale = TRUE, center = TRUE, add.intercept = FALSE)
## Condition
## Index    Variance Decomposition Proportions
##          Enroll Top10perc
## 1  1.000 0.409  0.409    
## 2  1.201 0.591  0.591
summary(lm.fit.reduced)
## 
## Call:
## lm(formula = Apps ~ Enroll + Top10perc, data = College)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9468   -666   -115    409  32087 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -795.11465  134.15208  -5.927 4.64e-09 ***
## Enroll         3.38249    0.07572  44.671  < 2e-16 ***
## Top10perc     42.03776    3.98841  10.540  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1927 on 774 degrees of freedom
## Multiple R-squared:  0.7526, Adjusted R-squared:  0.752 
## F-statistic:  1177 on 2 and 774 DF,  p-value: < 2.2e-16
vif(lm.fit.reduced)
##    Enroll Top10perc 
##  1.033984  1.033984

3.7 Is there a multicollinearity issue in the model above? Why or why not? Multicollinerairty isn’t issue because the variable were reduced.

4. Variable Selection: Subset Comparison

4.1 Fit a large model with all variables that make sense from a business standpoint: Enroll, Top10perc, Outstate, Room.Board, PhD, S.F.Ratio, Expend and Grad.Rate. Name this model lm.fit.large. Display the model summary results.

lm.fit.large<-lm(Apps~Enroll+Top10perc+Outstate+Room.Board+PhD+S.F.Ratio+Expend+Grad.Rate, College)
summary(lm.fit.large)
## 
## Call:
## lm(formula = Apps ~ Enroll + Top10perc + Outstate + Room.Board + 
##     PhD + S.F.Ratio + Expend + Grad.Rate, data = College)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8652.4  -617.6  -114.7   474.0 31465.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.922e+03  5.586e+02  -7.023 4.80e-12 ***
## Enroll       3.416e+00  8.293e-02  41.187  < 2e-16 ***
## Top10perc    1.613e+01  5.689e+00   2.834  0.00471 ** 
## Outstate    -3.656e-02  2.940e-02  -1.243  0.21408    
## Room.Board   4.400e-01  8.062e-02   5.458 6.49e-08 ***
## PhD         -4.122e+00  5.146e+00  -0.801  0.42336    
## S.F.Ratio    4.704e+01  2.229e+01   2.110  0.03519 *  
## Expend       9.719e-02  2.104e-02   4.619 4.52e-06 ***
## Grad.Rate    1.493e+01  4.875e+00   3.064  0.00226 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1827 on 768 degrees of freedom
## Multiple R-squared:  0.7796, Adjusted R-squared:  0.7773 
## F-statistic: 339.5 on 8 and 768 DF,  p-value: < 2.2e-16

4.2 Then, compute the VIF’s for this large model and then conduct an ANOVA F test to evaluate if the larger model has more predictive power than the lm.fit.reduced model above. Provide your brief conclusions about what these two tests are telling you and pick a model based on this analysis.

vif(lm.fit.large)
##     Enroll  Top10perc   Outstate Room.Board        PhD  S.F.Ratio 
##   1.380973   2.342835   3.254419   1.818213   1.642117   1.811435 
##     Expend  Grad.Rate 
##   2.808017   1.631037
anova(lm.fit.reduced, lm.fit.large)
## Analysis of Variance Table
## 
## Model 1: Apps ~ Enroll + Top10perc
## Model 2: Apps ~ Enroll + Top10perc + Outstate + Room.Board + PhD + S.F.Ratio + 
##     Expend + Grad.Rate
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1    774 2875431853                                  
## 2    768 2562315222  6 313116630 15.642 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.3 Best Subset Selection. Fit the same lm.fit.large model above, but this time use the regsubsets(){leaps} function. Store the model summary results summary(lm.fit.large) in an object named large.sum (please note that we are storing the summary() object, not the lm() object). Display fit.large.sum to see all 8 models evaluated by regsubsets().

One nice thing about the regsubsets() function is that it provides various fit statistics for all the models tried. In this case, the default is 8 models (from 1 to 8 predictors), so the fit.large.sum\(rss and fit.large.sum\)adjr2 attributes contain 2 vectors with 8 elements each, containing the RSS and Adjusted R-Squared for each of the 8 models.

Display these RSS and AdjR2 values as a table by binding the 2 vectors with the cbind() function and naming the two columns “RSS” and “AdjR2” respectively.

library(leaps)
fit.large.sum <- regsubsets(Apps~Enroll+Top10perc+Outstate+Room.Board+PhD+S.F.Ratio+Expend+Grad.Rate, data =  College)
summary(fit.large.sum)
## Subset selection object
## Call: regsubsets.formula(Apps ~ Enroll + Top10perc + Outstate + Room.Board + 
##     PhD + S.F.Ratio + Expend + Grad.Rate, data = College)
## 8 Variables  (and intercept)
##            Forced in Forced out
## Enroll         FALSE      FALSE
## Top10perc      FALSE      FALSE
## Outstate       FALSE      FALSE
## Room.Board     FALSE      FALSE
## PhD            FALSE      FALSE
## S.F.Ratio      FALSE      FALSE
## Expend         FALSE      FALSE
## Grad.Rate      FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          Enroll Top10perc Outstate Room.Board PhD S.F.Ratio Expend
## 1  ( 1 ) "*"    " "       " "      " "        " " " "       " "   
## 2  ( 1 ) "*"    " "       " "      " "        " " " "       "*"   
## 3  ( 1 ) "*"    " "       " "      "*"        " " " "       "*"   
## 4  ( 1 ) "*"    " "       " "      "*"        " " " "       "*"   
## 5  ( 1 ) "*"    "*"       " "      "*"        " " " "       "*"   
## 6  ( 1 ) "*"    "*"       " "      "*"        " " "*"       "*"   
## 7  ( 1 ) "*"    "*"       "*"      "*"        " " "*"       "*"   
## 8  ( 1 ) "*"    "*"       "*"      "*"        "*" "*"       "*"   
##          Grad.Rate
## 1  ( 1 ) " "      
## 2  ( 1 ) " "      
## 3  ( 1 ) " "      
## 4  ( 1 ) "*"      
## 5  ( 1 ) "*"      
## 6  ( 1 ) "*"      
## 7  ( 1 ) "*"      
## 8  ( 1 ) "*"
ADJR2 = summary(fit.large.sum)$adjr2
RSS = summary(fit.large.sum)$rss
cbind(RSS, ADJR2)
##             RSS     ADJR2
## [1,] 3288139052 0.7167426
## [2,] 2796443926 0.7587885
## [3,] 2655057929 0.7706877
## [4,] 2608042300 0.7744566
## [5,] 2587933194 0.7759053
## [6,] 2570557773 0.7771208
## [7,] 2564455998 0.7773607
## [8,] 2562315222 0.7772569

4.4 Plot these RSS and AdjR2 side by side. Tip: (1) start with par(mfrow=c(1,2)) to split the display into 1 row and 2 columns; (2) then use the plot() functions with appropriate labels and use the attribute type=“l” to get a line; (3) then reset the display to a single plot with par(mfrow=c(1,1)). Based on your plot, which is the best model? Fit an lm() model with the predictors in your selected best model and display the summary() results.

par(mfrow=c(2,2))
plot(RSS, xlab="Number of Variables",ylab="RSS",type="l")
plot(ADJR2, xlab="Number of Variables",ylab="RSS",type="l")

fit.apps <- lm(Apps~Enroll+Room.Board+Expend+Grad.Rate, College) 
summary(fit.apps)
## 
## Call:
## lm(formula = Apps ~ Enroll + Room.Board + Expend + Grad.Rate, 
##     data = College)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8151.2  -626.3  -118.1   443.3 31524.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.326e+03  3.232e+02 -10.291  < 2e-16 ***
## Enroll       3.519e+00  7.144e-02  49.253  < 2e-16 ***
## Room.Board   3.742e-01  7.281e-02   5.139  3.5e-07 ***
## Expend       9.192e-02  1.507e-02   6.098  1.7e-09 ***
## Grad.Rate    1.626e+01  4.358e+00   3.731 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1838 on 772 degrees of freedom
## Multiple R-squared:  0.7756, Adjusted R-squared:  0.7745 
## F-statistic: 667.1 on 4 and 772 DF,  p-value: < 2.2e-16

4.5 Let’s try a couple of Stepwise approaches to variable selection using the step(){stats} function. For both approaches, do a stepwise to select the optimal model between lm.fit.reduced and lm.fit.large (tip: the scope=list() functions should have the same scope for both models, from the lower bound model of lm.fit.reduced to the upper bound model of lm.fit.large). Also, in both cases, use direction=“both” (for stepwise) and test=“F” to get p-values for the predictors.

Name the first model lm.step.forward and use the lm.fit.reduced model as the starting base. Name the second model lm.step.back, use the lm.fit.large model as the starting base (Tip: the first approach will start with the reduced model and proceed forward towards the large model, but in a stepwise fashion. The second approach will start with the large model and proceed backwards towards the reduced model, but in a stepwise fashion).

After you model both stepwise approaches, display the summary() for both models.

lm.step.forward <- step(lm.fit.reduced, scope=list(lower= lm.fit.reduced, upper=lm.fit.large), direction="both", test="F")
## Start:  AIC=11757.37
## Apps ~ Enroll + Top10perc
## 
##              Df Sum of Sq        RSS   AIC F value    Pr(>F)    
## + Room.Board  1 218577239 2656854614 11698 63.5941 5.497e-15 ***
## + Expend      1 135660161 2739771692 11722 38.2752 9.955e-10 ***
## + Outstate    1  89091480 2786340372 11735 24.7162 8.186e-07 ***
## + Grad.Rate   1  76875307 2798556546 11738 21.2340 4.754e-06 ***
## + S.F.Ratio   1   8369163 2867062690 11757  2.2564    0.1335    
## <none>                    2875431853 11757                      
## + PhD         1   6774665 2868657188 11758  1.8255    0.1771    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=11697.94
## Apps ~ Enroll + Top10perc + Room.Board
## 
##              Df Sum of Sq        RSS   AIC F value    Pr(>F)    
## + Expend      1  45538227 2611316387 11686 13.4628 0.0002601 ***
## + Grad.Rate   1  22200694 2634653920 11693  6.5052 0.0109479 *  
## <none>                    2656854614 11698                      
## + Outstate    1   1265831 2655588783 11700  0.3680 0.5442821    
## + S.F.Ratio   1    298859 2656555755 11700  0.0868 0.7683012    
## + PhD         1    273803 2656580811 11700  0.0796 0.7779608    
## - Room.Board  1 218577239 2875431853 11757 63.5941 5.497e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=11686.51
## Apps ~ Enroll + Top10perc + Room.Board + Expend
## 
##              Df Sum of Sq        RSS   AIC F value    Pr(>F)    
## + Grad.Rate   1  23383194 2587933194 11682  6.9663 0.0084735 ** 
## + S.F.Ratio   1  15066751 2596249636 11684  4.4743 0.0347280 *  
## <none>                    2611316387 11686                      
## + Outstate    1   2092747 2609223640 11688  0.6184 0.4318893    
## + PhD         1   1247624 2610068763 11688  0.3685 0.5439792    
## - Expend      1  45538227 2656854614 11698 13.4628 0.0002601 ***
## - Room.Board  1 128455305 2739771692 11722 37.9761 1.153e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=11681.52
## Apps ~ Enroll + Top10perc + Room.Board + Expend + Grad.Rate
## 
##              Df Sum of Sq        RSS   AIC F value    Pr(>F)    
## + S.F.Ratio   1  17375420 2570557773 11678  5.2047 0.0227976 *  
## + Outstate    1   9759076 2578174118 11681  2.9147 0.0881813 .  
## <none>                    2587933194 11682                      
## + PhD         1   1750276 2586182918 11683  0.5211 0.4705834    
## - Grad.Rate   1  23383194 2611316387 11686  6.9663 0.0084735 ** 
## - Expend      1  46720727 2634653920 11693 13.9191 0.0002049 ***
## - Room.Board  1  91424463 2679357657 11706 27.2373 2.316e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=11678.28
## Apps ~ Enroll + Top10perc + Room.Board + Expend + Grad.Rate + 
##     S.F.Ratio
## 
##              Df Sum of Sq        RSS   AIC F value    Pr(>F)    
## <none>                    2570557773 11678                      
## + Outstate    1   6101776 2564455998 11678  1.8297  0.176557    
## + PhD         1   3083978 2567473795 11679  0.9237  0.336807    
## - S.F.Ratio   1  17375420 2587933194 11682  5.2047  0.022798 *  
## - Grad.Rate   1  25691863 2596249636 11684  7.6959  0.005669 ** 
## - Expend      1  63525871 2634083644 11695 19.0289 1.463e-05 ***
## - Room.Board  1  95376286 2665934059 11705 28.5696 1.192e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.step.forward)
## 
## Call:
## lm(formula = Apps ~ Enroll + Top10perc + Room.Board + Expend + 
##     Grad.Rate + S.F.Ratio, data = College)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8487.7  -651.9  -112.5   477.6 31538.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.122e+03  5.256e+02  -7.843 1.47e-14 ***
## Enroll       3.422e+00  7.709e-02  44.389  < 2e-16 ***
## Top10perc    1.372e+01  5.408e+00   2.536  0.01141 *  
## Room.Board   3.875e-01  7.250e-02   5.345 1.19e-07 ***
## Expend       8.740e-02  2.003e-02   4.362 1.46e-05 ***
## Grad.Rate    1.286e+01  4.637e+00   2.774  0.00567 ** 
## S.F.Ratio    4.981e+01  2.183e+01   2.281  0.02280 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1827 on 770 degrees of freedom
## Multiple R-squared:  0.7788, Adjusted R-squared:  0.7771 
## F-statistic:   452 on 6 and 770 DF,  p-value: < 2.2e-16
lm.step.back <- step(lm.fit.large, scope=list(lower= lm.fit.reduced, upper=lm.fit.large), direction="both", test="F")
## Start:  AIC=11679.79
## Apps ~ Enroll + Top10perc + Outstate + Room.Board + PhD + S.F.Ratio + 
##     Expend + Grad.Rate
## 
##              Df Sum of Sq        RSS   AIC F value    Pr(>F)    
## - PhD         1   2140775 2564455998 11678  0.6417  0.423361    
## - Outstate    1   5158573 2567473795 11679  1.5462  0.214081    
## <none>                    2562315222 11680                      
## - S.F.Ratio   1  14851935 2577167157 11682  4.4516  0.035192 *  
## - Grad.Rate   1  31313260 2593628482 11687  9.3855  0.002264 ** 
## - Expend      1  71183114 2633498336 11699 21.3356 4.519e-06 ***
## - Room.Board  1  99401404 2661716627 11707 29.7935 6.490e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=11678.44
## Apps ~ Enroll + Top10perc + Outstate + Room.Board + S.F.Ratio + 
##     Expend + Grad.Rate
## 
##              Df Sum of Sq        RSS   AIC F value    Pr(>F)    
## - Outstate    1   6101776 2570557773 11678  1.8297  0.176557    
## <none>                    2564455998 11678                      
## + PhD         1   2140775 2562315222 11680  0.6417  0.423361    
## - S.F.Ratio   1  13718120 2578174118 11681  4.1136  0.042883 *  
## - Grad.Rate   1  31183379 2595639376 11686  9.3509  0.002306 ** 
## - Expend      1  69627645 2634083643 11697 20.8791 5.697e-06 ***
## - Room.Board  1  97587525 2662043522 11706 29.2634 8.443e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=11678.28
## Apps ~ Enroll + Top10perc + Room.Board + S.F.Ratio + Expend + 
##     Grad.Rate
## 
##              Df Sum of Sq        RSS   AIC F value    Pr(>F)    
## <none>                    2570557773 11678                      
## + Outstate    1   6101776 2564455998 11678  1.8297  0.176557    
## + PhD         1   3083978 2567473795 11679  0.9237  0.336807    
## - S.F.Ratio   1  17375420 2587933194 11682  5.2047  0.022798 *  
## - Grad.Rate   1  25691863 2596249636 11684  7.6959  0.005669 ** 
## - Expend      1  63525871 2634083644 11695 19.0289 1.463e-05 ***
## - Room.Board  1  95376286 2665934059 11705 28.5696 1.192e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.step.back)
## 
## Call:
## lm(formula = Apps ~ Enroll + Top10perc + Room.Board + S.F.Ratio + 
##     Expend + Grad.Rate, data = College)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8487.7  -651.9  -112.5   477.6 31538.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.122e+03  5.256e+02  -7.843 1.47e-14 ***
## Enroll       3.422e+00  7.709e-02  44.389  < 2e-16 ***
## Top10perc    1.372e+01  5.408e+00   2.536  0.01141 *  
## Room.Board   3.875e-01  7.250e-02   5.345 1.19e-07 ***
## S.F.Ratio    4.981e+01  2.183e+01   2.281  0.02280 *  
## Expend       8.740e-02  2.003e-02   4.362 1.46e-05 ***
## Grad.Rate    1.286e+01  4.637e+00   2.774  0.00567 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1827 on 770 degrees of freedom
## Multiple R-squared:  0.7788, Adjusted R-squared:  0.7771 
## F-statistic:   452 on 6 and 770 DF,  p-value: < 2.2e-16

4.6 Compare the two stepwise results above. Is there any difference? Also, compare your stewise model selection with the model selected above in 4.3 using regsubsets(). Are the models different? Which one would you pick? Is there an additional test to select the best of these models (no need to run the test, just answer the question)

The stepwise model are the same. After comparing the regsubesets to he stepwise they also appear to have similar adjusted R2’s

5. Dimension Reduction: Ridge Regression

5.1 Ridge Regression: in this section you will be using the glmnet{glmnet} and other related {glmnet} functions to model an fit Ridge regression. First, create an x predictor matrix with all the predictors you used in fit.large above, and a y vector with the Apps response variable. Then fit a Ridge regression x and y. Name the resulting object ridge.mod and display it.

[For you only]: The ridge.mod display will be quite long because it will generate Ridge regressions with 100 different values of lambda. Think but don’t answer yet: look at how the deviance varies as the value of lambda reduces. The first regression with a very high lambda should be close to an OLS regression with no variables, just the intercept. This is called the “null” model. The last regression with a very small lambda should be close to an OLS regression with all predictors. This is called the “full” model. The trick here is to find the optimal lambda that minimizes the deviance. Can you spot the lambda that gives you the smallest lambda?

5.2 Using the cv.glmnet(){glmnet} function, compute the cross-validation statistics for the ridge.mod model above (tip: you need to use x and y per above, not ridge.mod). Store the results in an object named ridge.cv.

library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
x=model.matrix(Apps~Enroll+Top10perc+Outstate+Room.Board+PhD+S.F.Ratio+Expend+Grad.Rate, College)
y=College$Apps
ridge.mod=glmnet(x,y,alpha=0,)
ridge.mod
## 
## Call:  glmnet(x = x, y = y, alpha = 0) 
## 
##        Df      %Dev    Lambda
##   [1,]  8 2.247e-36 3275000.0
##   [2,]  8 2.873e-03 2984000.0
##   [3,]  8 3.152e-03 2719000.0
##   [4,]  8 3.458e-03 2478000.0
##   [5,]  8 3.793e-03 2258000.0
##   [6,]  8 4.161e-03 2057000.0
##   [7,]  8 4.565e-03 1874000.0
##   [8,]  8 5.007e-03 1708000.0
##   [9,]  8 5.492e-03 1556000.0
##  [10,]  8 6.023e-03 1418000.0
##  [11,]  8 6.606e-03 1292000.0
##  [12,]  8 7.244e-03 1177000.0
##  [13,]  8 7.943e-03 1073000.0
##  [14,]  8 8.709e-03  977200.0
##  [15,]  8 9.549e-03  890400.0
##  [16,]  8 1.047e-02  811300.0
##  [17,]  8 1.147e-02  739200.0
##  [18,]  8 1.257e-02  673600.0
##  [19,]  8 1.378e-02  613700.0
##  [20,]  8 1.510e-02  559200.0
##  [21,]  8 1.654e-02  509500.0
##  [22,]  8 1.812e-02  464300.0
##  [23,]  8 1.984e-02  423000.0
##  [24,]  8 2.172e-02  385400.0
##  [25,]  8 2.378e-02  351200.0
##  [26,]  8 2.602e-02  320000.0
##  [27,]  8 2.847e-02  291600.0
##  [28,]  8 3.114e-02  265700.0
##  [29,]  8 3.404e-02  242100.0
##  [30,]  8 3.721e-02  220600.0
##  [31,]  8 4.066e-02  201000.0
##  [32,]  8 4.441e-02  183100.0
##  [33,]  8 4.848e-02  166800.0
##  [34,]  8 5.290e-02  152000.0
##  [35,]  8 5.769e-02  138500.0
##  [36,]  8 6.288e-02  126200.0
##  [37,]  8 6.850e-02  115000.0
##  [38,]  8 7.457e-02  104800.0
##  [39,]  8 8.113e-02   95480.0
##  [40,]  8 8.819e-02   86990.0
##  [41,]  8 9.579e-02   79270.0
##  [42,]  8 1.040e-01   72220.0
##  [43,]  8 1.127e-01   65810.0
##  [44,]  8 1.221e-01   59960.0
##  [45,]  8 1.321e-01   54630.0
##  [46,]  8 1.428e-01   49780.0
##  [47,]  8 1.542e-01   45360.0
##  [48,]  8 1.662e-01   41330.0
##  [49,]  8 1.790e-01   37660.0
##  [50,]  8 1.924e-01   34310.0
##  [51,]  8 2.066e-01   31260.0
##  [52,]  8 2.215e-01   28490.0
##  [53,]  8 2.371e-01   25960.0
##  [54,]  8 2.533e-01   23650.0
##  [55,]  8 2.702e-01   21550.0
##  [56,]  8 2.876e-01   19630.0
##  [57,]  8 3.057e-01   17890.0
##  [58,]  8 3.242e-01   16300.0
##  [59,]  8 3.431e-01   14850.0
##  [60,]  8 3.624e-01   13530.0
##  [61,]  8 3.820e-01   12330.0
##  [62,]  8 4.017e-01   11240.0
##  [63,]  8 4.216e-01   10240.0
##  [64,]  8 4.415e-01    9328.0
##  [65,]  8 4.613e-01    8499.0
##  [66,]  8 4.808e-01    7744.0
##  [67,]  8 5.002e-01    7056.0
##  [68,]  8 5.191e-01    6429.0
##  [69,]  8 5.375e-01    5858.0
##  [70,]  8 5.555e-01    5338.0
##  [71,]  8 5.728e-01    4864.0
##  [72,]  8 5.894e-01    4432.0
##  [73,]  8 6.053e-01    4038.0
##  [74,]  8 6.204e-01    3679.0
##  [75,]  8 6.347e-01    3352.0
##  [76,]  8 6.482e-01    3055.0
##  [77,]  8 6.608e-01    2783.0
##  [78,]  8 6.726e-01    2536.0
##  [79,]  8 6.836e-01    2311.0
##  [80,]  8 6.937e-01    2105.0
##  [81,]  8 7.030e-01    1918.0
##  [82,]  8 7.115e-01    1748.0
##  [83,]  8 7.193e-01    1593.0
##  [84,]  8 7.263e-01    1451.0
##  [85,]  8 7.327e-01    1322.0
##  [86,]  8 7.385e-01    1205.0
##  [87,]  8 7.436e-01    1098.0
##  [88,]  8 7.482e-01    1000.0
##  [89,]  8 7.523e-01     911.4
##  [90,]  8 7.559e-01     830.4
##  [91,]  8 7.591e-01     756.6
##  [92,]  8 7.619e-01     689.4
##  [93,]  8 7.644e-01     628.2
##  [94,]  8 7.665e-01     572.4
##  [95,]  8 7.684e-01     521.5
##  [96,]  8 7.700e-01     475.2
##  [97,]  8 7.714e-01     433.0
##  [98,]  8 7.726e-01     394.5
##  [99,]  8 7.737e-01     359.5
## [100,]  8 7.746e-01     327.5

5.3 Plot the ridge.cv object

ridge.cv = cv.glmnet(x,y, alpha = 0)
plot(ridge.cv)

5.4 Find the best lambda and store it in an object named bestlam. Display the bestlam object.

bestlam = ridge.cv$lambda.min
bestlam
## [1] 359.4596

5.5 Find the coefficients for a Ridge regression using the best lambda you just found using the keywords type=“coefficients” and s=bestlam.

coef(ridge.cv, bestlam)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -3.953459e+03
## (Intercept)  .           
## Enroll       3.065774e+00
## Top10perc    1.826907e+01
## Outstate    -3.808982e-02
## Room.Board   3.840698e-01
## PhD          4.193003e+00
## S.F.Ratio    5.443065e+01
## Expend       8.933287e-02
## Grad.Rate    1.298359e+01

6. Dimension Reduction: Principal Components (PCR)

6.1 Principal Components Regression (PCR): we will use the pcr(){pls} function and the College{ISLR} data set for this exercise. First, set the seed to 1 (or any number you wish) to get replicable results. Then Split the data into a training subset of 70% of the data. Tip - you can use this function: train=sample(1:nrow(College), 0.7*nrow(College)), which will provide a vector named train with the indices for the training records (can you see why?).

We are attempting to reduce the dimenson of the model by only selecting a cetain percentage of the data to train with. This is accomplished by both using on portion of the data and removing some variables.

6.2 Fit a PCR model to predict collegue applications as a function of all remaining variables as predictors, using the training set (i.e., College[train,]). Store the results in an object named pcr.fit. Display the results summary for pcr.fit.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
pcr.fit=pcr(Apps~., data=College[train,],scale=TRUE,validation="CV")
summary(pcr.fit)
## Data:    X dimension: 317 17 
##  Y dimension: 317 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            2952     2972     1431     1355     1361     1263     1139
## adjCV         2952     2972     1429     1350     1359     1229     1130
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1108     1103     1069      1051      1051      1057      1078
## adjCV     1103     1097     1058      1043      1047      1053      1073
##        14 comps  15 comps  16 comps  17 comps
## CV         1071      1064     912.6     902.3
## adjCV      1066      1071     908.0     898.1
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X     32.3135    56.33    63.68    70.12    75.32    80.28    84.31
## Apps   0.5769    77.75    80.14    80.14    85.62    86.97    87.53
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       87.38    90.18     92.70     95.07     97.01     98.29     98.95
## Apps    87.73    88.64     88.71     88.76     88.84     88.84     89.01
##       15 comps  16 comps  17 comps
## X        99.42     99.85    100.00
## Apps     89.49     91.88     92.14

6.3 Display a scree plot for the MSE of pcr.fit using the validationplot(){pls} function. How many components would you select? Naturally, the MSE goes down as we add more components to the model. The key is to find the minimal number of components that provide an acceptable MSE. In other words, where is the “elbow”? The elbows are at component 1 and 5.

set.seed(1)
train=sample(1:nrow(College), 0.7*nrow(College))
set.seed(1)
pcr.fit=pcr(Apps~., data=College[train,],scale=TRUE,validation="CV")
validationplot(pcr.fit,val.type="RMSEP")

6.4 Regardless of your answer in 3 above, compute predicted values for the test data (tip: use College[-train,] for this) using 5 components.

6.5 Compute the MSE for the test data.

library(ISLR)
x=College$Apps
mse.test <- mean((x-predict(pcr.fit,College))[-train]^2)
mse.test
## [1] 2573802

6.6 Now fit a model using the full data set and 5 components. Store the results in an object named pcr.fit.5. Display the summary results for pcr.fit.5, its coefficients and loadings.

pcr.fit.5=pcr(Apps~., data=College, scale=TRUE, ncomp=5) 
summary(pcr.fit.5)
## Data:    X dimension: 777 17 
##  Y dimension: 777 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps
## X      31.670    57.30    64.30    69.90    75.39
## Apps    2.316    73.06    73.07    82.08    84.08
coefficients(pcr.fit.5)
## , , 5 comps
## 
##                   Apps
## PrivateYes  -180.78762
## Accept      1151.27243
## Enroll      1080.76811
## Top10perc    216.38940
## Top25perc    227.00817
## F.Undergrad 1001.26909
## P.Undergrad  390.05707
## Outstate     157.61035
## Room.Board   305.73454
## Books         89.71071
## Personal    -114.07529
## PhD         -156.09983
## Terminal    -170.91013
## S.F.Ratio     62.28650
## perc.alumni  -26.59023
## Expend       181.07726
## Grad.Rate    474.54767
loadings(pcr.fit.5)
## 
## Loadings:
##             Comp 1 Comp 2 Comp 3 Comp 4 Comp 5
## PrivateYes  -0.203  0.320 -0.149 -0.207       
## Accept             -0.419        -0.362  0.112
## Enroll             -0.443        -0.250  0.175
## Top10perc   -0.345 -0.130         0.221  0.332
## Top25perc   -0.319 -0.161         0.252  0.344
## F.Undergrad        -0.448        -0.202  0.136
## P.Undergrad  0.117 -0.297 -0.151 -0.107 -0.303
## Outstate    -0.374               -0.203 -0.138
## Room.Board  -0.283        -0.181 -0.383 -0.413
## Books                     -0.663  0.111  0.131
## Personal     0.133 -0.174 -0.472  0.334       
## PhD         -0.247 -0.255  0.205  0.346 -0.355
## Terminal    -0.252 -0.243  0.143  0.319 -0.409
## S.F.Ratio    0.268 -0.125  0.292              
## perc.alumni -0.290         0.160         0.212
## Expend      -0.337        -0.217              
## Grad.Rate   -0.297         0.174 -0.240  0.250
## 
##                Comp 1 Comp 2 Comp 3 Comp 4 Comp 5
## SS loadings     1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.059  0.059  0.059  0.059  0.059
## Cumulative Var  0.059  0.118  0.176  0.235  0.294

6.7 Partial Least Squares (PLS): Fitting a PLS model is identical to fitting a PCR model, except that you need to use the plsr() function instead of the pcr() function. Copy and paste the entire script portion for PCR above and paste it into another r code segment. Then change pcr() with plsr() to fit a PLS model. You should also rename all the objects in your script (e.g., rename pcr.fit to pls.fit, etc.)

set.seed(1)
plsr.fit=plsr(Apps~., data=College[train,],scale=TRUE,validation="CV")
summary(plsr.fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4149     2023     1624     1574     1483     1352     1242
## adjCV         4149     2020     1614     1577     1466     1336     1231
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1239     1233     1230      1228      1225      1226      1225
## adjCV     1228     1223     1219      1217      1215      1216      1215
##        14 comps  15 comps  16 comps  17 comps
## CV         1225      1225      1225      1225
## adjCV      1215      1215      1215      1215
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       26.58    34.50    62.81    65.71    70.16    73.53    77.38
## Apps    77.24    86.74    87.74    91.19    92.69    93.39    93.43
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       80.12    83.16     86.62     89.35     90.74     92.26     93.88
## Apps    93.49    93.52     93.54     93.55     93.56     93.57     93.57
##       15 comps  16 comps  17 comps
## X        96.08     98.25    100.00
## Apps     93.57     93.57     93.57
validationplot(plsr.fit,val.type="RMSEP")

mse.test <- mean((x-predict(plsr.fit,College))[-train]^2)
mse.test
## [1] 1252240
plsr.fit.5=plsr(Apps~., data=College, scale=TRUE, ncomp=5) 
summary(plsr.fit.5)
## Data:    X dimension: 777 17 
##  Y dimension: 777 1
## Fit method: kernelpls
## Number of components considered: 5
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps
## X       25.76    40.33    62.59    64.97    66.87
## Apps    78.01    85.14    87.67    90.73    92.63
coefficients(plsr.fit.5)
## , , 5 comps
## 
##                   Apps
## PrivateYes  -189.85988
## Accept      3732.88397
## Enroll      -100.37277
## Top10perc    601.62345
## Top25perc   -128.78222
## F.Undergrad -311.36808
## P.Undergrad   85.34312
## Outstate    -305.43943
## Room.Board   195.77152
## Books         29.32930
## Personal      49.15994
## PhD          -69.26201
## Terminal     -99.68608
## S.F.Ratio    153.60303
## perc.alumni   32.52532
## Expend       516.49285
## Grad.Rate    105.61206
loadings(plsr.fit.5)
## 
## Loadings:
##             Comp 1 Comp 2 Comp 3 Comp 4 Comp 5
## PrivateYes  -0.257 -0.161  0.326 -0.268  0.197
## Accept       0.432  0.331         0.352  0.236
## Enroll       0.441  0.300        -0.140 -0.189
## Top10perc    0.210 -0.452  0.282              
## Top25perc    0.232 -0.449  0.233 -0.112       
## F.Undergrad  0.436  0.278        -0.255 -0.218
## P.Undergrad  0.257        -0.250 -0.835  0.773
## Outstate           -0.381  0.412 -0.187       
## Room.Board         -0.229  0.319 -0.233  0.151
## Books                                    0.142
## Personal     0.128        -0.245  0.244       
## PhD          0.292 -0.585         0.332       
## Terminal     0.282 -0.594         0.281       
## S.F.Ratio           0.282 -0.316         0.170
## perc.alumni        -0.400  0.286 -0.389  0.396
## Expend       0.144 -0.362  0.333  0.257 -0.146
## Grad.Rate          -0.192  0.359 -0.542  0.192
## 
##                Comp 1 Comp 2 Comp 3 Comp 4 Comp 5
## SS loadings     1.026  2.019  1.075  1.844  1.090
## Proportion Var  0.060  0.119  0.063  0.108  0.064
## Cumulative Var  0.060  0.179  0.242  0.351  0.415