1.1 Using the same Salaries{car} data set, load and attach the data set (i.e., attach(Salaries)) into your work environment.
library(carData)
data("Salaries")
attach(Salaries)
1.2 Enter set.seed(15) so that you get the same results if you run your cross validation commands multiple times. Then create an index vector called “train” which you can use to split the data set into 80% train and 20% test subsets.
set.seed(15)
train = sample(nrow(Salaries),0.80*nrow(Salaries))
1.3 Fit a linear model to predict salary using all remaining variables as predictors, using the train data subset. Store your resulting model in an object named fit.train and display the summary results.
fit.train=lm(salary ~ ., data = Salaries, subset = train)
summary(fit.train)
##
## Call:
## lm(formula = salary ~ ., data = Salaries, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64470 -13263 -1590 10311 99825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66727.2 5192.2 12.851 < 2e-16 ***
## rankAssocProf 14638.7 4614.7 3.172 0.00166 **
## rankProf 46355.4 4708.1 9.846 < 2e-16 ***
## disciplineB 16170.5 2616.6 6.180 2.02e-09 ***
## yrs.since.phd 448.7 272.1 1.649 0.10016
## yrs.service -468.7 239.1 -1.960 0.05087 .
## sexMale 3225.5 4450.4 0.725 0.46914
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22380 on 310 degrees of freedom
## Multiple R-squared: 0.456, Adjusted R-squared: 0.4455
## F-statistic: 43.31 on 6 and 310 DF, p-value: < 2.2e-16
1.4 Using the fit.train model, compute the MSE for the full data set and for the train and test subsets. Store the results in objects named full.mse, train.mse and test.mse, respectively. Then, use the c() function to display these three results with their respective labels “Full MSE”, “Train MSE” and “Test MSE
full.mse <- mean((salary-predict(fit.train,Salaries))^2)
train.mse <- mean((salary-predict(fit.train,Salaries))[train]^2)
test.mse <- mean((salary-predict(fit.train,Salaries))[-train]^2)
mse.all = c("Full MSE"=full.mse, "Train MSE"=train.mse, "Test MSE"=test.mse)
mse.all
## Full MSE Train MSE Test MSE
## 500994115 489631119 546019986
1.5 Analyze the difference between these MSE values and briefly comment on your conclusions. Is this what you expected? Why or why not?
Yes this is expected. The MSE is always expected to be lower in the training data than in the test mse. The MSE calculated with the same training data used to develop the model. When models are over-identified the training error tends to be small.
2.1 Using the Salaries{car} data set, fit a GLM model to predict salary using all predictors. Display the summary results. Store the results in an object named glm.fit. Tip: when you use the glm() function you need to specify the family and the link function. However, if you don’t specify a family, the gaussian family (i.e., normal distribution) and the “identity” link (i.e., no transformation of the response variable) will be used as defaults. So just use the glm() function exactly how you use the lm() function and the result will be an OLS model.
glm.fit = glm(salary ~ ., data=Salaries)
summary(glm.fit)
##
## Call:
## glm(formula = salary ~ ., data = Salaries)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -65248 -13211 -1775 10384 99592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65955.2 4588.6 14.374 < 2e-16 ***
## rankAssocProf 12907.6 4145.3 3.114 0.00198 **
## rankProf 45066.0 4237.5 10.635 < 2e-16 ***
## disciplineB 14417.6 2342.9 6.154 1.88e-09 ***
## yrs.since.phd 535.1 241.0 2.220 0.02698 *
## yrs.service -489.5 211.9 -2.310 0.02143 *
## sexMale 4783.5 3858.7 1.240 0.21584
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 507990599)
##
## Null deviance: 3.6330e+11 on 396 degrees of freedom
## Residual deviance: 1.9812e+11 on 390 degrees of freedom
## AIC: 9093.8
##
## Number of Fisher Scoring iterations: 2
2.2 Using the cv.glm(){boot} function and the glm.fit object above, compute and display the LOOCV MSE (Leave One Out) for this model (stored in the first attribute of the “delta” attribute.
library(boot)
cv.loo <- cv.glm(Salaries,glm.fit)
cv.loo$delta[1]
## [1] 516540699
2.4 Using the same cv.glm(){boot} function and glm.fit model object, compute and display the 10-Fold cross validation MSE for this model.
cv.10K <- cv.glm(Salaries,glm.fit,K=10)
cv.10K$delta[1]
## [1] 517705511
2.5 Compare the differences between the 10FCV result above and this LOOCV result and provide a brief concluding comment. Is there a meaning to the difference between these 2 MSE values? Please explain why or why not. Yes there is a difference. The 10FCV is trained against 90% of the dataset immediately and is less intensive. The data is split into 10 equal parts and 10 models are run. The LOOCV model runs a test for each n. This would lead to a difference in the MSE
3.1 Fit a full model to predict applications using all remaining variables as predictors and name it lm.fit.all.
library(ISLR)
library(perturb)
lm.fit.all <-lm(Apps~Accept+Enroll+Top10perc+Top25perc+F.Undergrad+P.Undergrad+Outstate+ Room.Board+Books+Personal+PhD+Terminal+S.F.Ratio+perc.alumni+Expend+Grad.Rate, College)
colldiag(lm.fit.all, scale = FALSE, center = FALSE, add.intercept = TRUE)
## Condition
## Index Variance Decomposition Proportions
## intercept Accept Enroll Top10perc Top25perc F.Undergrad
## 1 1.000 0.000 0.000 0.000 0.000 0.000 0.000
## 2 2.942 0.000 0.004 0.000 0.000 0.000 0.030
## 3 5.776 0.000 0.000 0.000 0.000 0.000 0.000
## 4 11.899 0.000 0.076 0.000 0.000 0.000 0.000
## 5 16.068 0.000 0.008 0.000 0.000 0.000 0.007
## 6 16.777 0.000 0.528 0.000 0.000 0.000 0.255
## 7 24.945 0.000 0.056 0.000 0.000 0.000 0.040
## 8 81.936 0.000 0.216 0.928 0.000 0.000 0.620
## 9 96.126 0.000 0.020 0.033 0.000 0.000 0.009
## 10 666.209 0.000 0.021 0.009 0.005 0.027 0.011
## 11 1028.079 0.000 0.001 0.000 0.049 0.085 0.000
## 12 1225.757 0.000 0.000 0.005 0.011 0.018 0.014
## 13 1844.757 0.000 0.029 0.007 0.002 0.002 0.002
## 14 2600.748 0.000 0.000 0.004 0.264 0.349 0.002
## 15 3068.396 0.000 0.005 0.002 0.500 0.434 0.000
## 16 4639.750 0.000 0.010 0.003 0.094 0.047 0.001
## 17 172742.443 1.000 0.026 0.009 0.074 0.039 0.008
## P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 0.000 0.007 0.000 0.000 0.000 0.000 0.000 0.000
## 2 0.001 0.009 0.000 0.000 0.000 0.000 0.000 0.000
## 3 0.000 0.193 0.005 0.000 0.000 0.000 0.000 0.000
## 4 0.377 0.068 0.047 0.000 0.017 0.000 0.000 0.000
## 5 0.464 0.299 0.249 0.000 0.031 0.000 0.000 0.000
## 6 0.079 0.126 0.068 0.000 0.000 0.000 0.000 0.000
## 7 0.007 0.042 0.216 0.000 0.623 0.000 0.000 0.000
## 8 0.011 0.003 0.000 0.014 0.013 0.000 0.000 0.000
## 9 0.000 0.001 0.130 0.787 0.136 0.000 0.000 0.000
## 10 0.009 0.145 0.101 0.059 0.027 0.029 0.021 0.000
## 11 0.031 0.005 0.062 0.004 0.025 0.044 0.043 0.000
## 12 0.016 0.028 0.023 0.002 0.000 0.019 0.009 0.000
## 13 0.000 0.046 0.025 0.001 0.008 0.004 0.001 0.000
## 14 0.001 0.002 0.021 0.047 0.007 0.508 0.286 0.004
## 15 0.003 0.004 0.000 0.003 0.005 0.387 0.564 0.003
## 16 0.000 0.021 0.012 0.031 0.015 0.009 0.025 0.630
## 17 0.000 0.002 0.040 0.052 0.094 0.000 0.052 0.362
## perc.alumni Expend Grad.Rate
## 1 0.000 0.014 0.000
## 2 0.000 0.006 0.000
## 3 0.000 0.627 0.000
## 4 0.000 0.002 0.000
## 5 0.000 0.000 0.000
## 6 0.000 0.003 0.000
## 7 0.000 0.000 0.000
## 8 0.000 0.002 0.000
## 9 0.000 0.000 0.000
## 10 0.003 0.012 0.037
## 11 0.001 0.050 0.002
## 12 0.018 0.038 0.687
## 13 0.954 0.002 0.126
## 14 0.011 0.064 0.010
## 15 0.003 0.074 0.011
## 16 0.004 0.064 0.051
## 17 0.006 0.040 0.076
3.2 Does the CI provide evidence of severe multicollinearity with the model? Why or why not? The CI shows severe multicollinearity because most of the CI Indencies are greater than 50.
3.3 Run the same colldiag() diagnostic, but first using scale=FALSE, center=TRUE, add.intercept=FALSE and then again using scale=TRUE, center=TRUE, add.intercept=FALSE. How do your results change. Please explain why these results changed, if they did?
As the scare moved around the CI’s changed because they variables were evaluated both standardized and not.
colldiag(lm.fit.all, scale = FALSE, center = TRUE, add.intercept = FALSE)
## Condition
## Index Variance Decomposition Proportions
## Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1 1.000 0.000 0.000 0.000 0.000 0.001 0.000
## 2 1.117 0.005 0.000 0.000 0.000 0.031 0.001
## 3 2.495 0.004 0.000 0.000 0.000 0.002 0.000
## 4 4.887 0.055 0.000 0.000 0.000 0.000 0.741
## 5 6.211 0.536 0.000 0.000 0.000 0.260 0.097
## 6 7.853 0.084 0.000 0.000 0.000 0.043 0.059
## 7 10.142 0.006 0.000 0.000 0.000 0.006 0.032
## 8 30.689 0.251 0.974 0.000 0.000 0.618 0.009
## 9 38.650 0.000 0.000 0.000 0.000 0.000 0.000
## 10 289.187 0.005 0.002 0.022 0.063 0.015 0.017
## 11 414.511 0.002 0.004 0.013 0.027 0.006 0.035
## 12 489.918 0.004 0.001 0.022 0.059 0.005 0.004
## 13 683.191 0.030 0.007 0.002 0.001 0.002 0.000
## 14 1018.983 0.004 0.006 0.192 0.217 0.004 0.000
## 15 1155.067 0.014 0.006 0.748 0.633 0.000 0.004
## 16 2113.142 0.000 0.000 0.000 0.000 0.006 0.000
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1 0.038 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 3 0.443 0.002 0.000 0.000 0.000 0.000 0.000 0.000
## 4 0.024 0.004 0.000 0.002 0.000 0.000 0.000 0.000
## 5 0.107 0.054 0.000 0.001 0.000 0.000 0.000 0.000
## 6 0.151 0.852 0.000 0.002 0.000 0.000 0.000 0.000
## 7 0.013 0.009 0.000 0.933 0.000 0.000 0.000 0.000
## 8 0.001 0.016 0.000 0.000 0.000 0.000 0.000 0.000
## 9 0.001 0.016 0.969 0.033 0.000 0.000 0.000 0.000
## 10 0.112 0.000 0.001 0.004 0.024 0.013 0.000 0.004
## 11 0.005 0.005 0.003 0.002 0.110 0.079 0.000 0.004
## 12 0.040 0.012 0.006 0.009 0.010 0.006 0.000 0.023
## 13 0.044 0.024 0.000 0.007 0.005 0.001 0.000 0.949
## 14 0.000 0.005 0.014 0.001 0.596 0.588 0.000 0.012
## 15 0.002 0.001 0.007 0.000 0.246 0.313 0.000 0.001
## 16 0.020 0.000 0.000 0.006 0.009 0.000 1.000 0.008
## Expend Grad.Rate
## 1 0.150 0.000
## 2 0.021 0.000
## 3 0.479 0.000
## 4 0.001 0.000
## 5 0.003 0.000
## 6 0.003 0.000
## 7 0.007 0.000
## 8 0.004 0.000
## 9 0.003 0.000
## 10 0.071 0.026
## 11 0.000 0.112
## 12 0.020 0.719
## 13 0.002 0.136
## 14 0.024 0.006
## 15 0.082 0.000
## 16 0.128 0.000
colldiag(lm.fit.all, scale = TRUE, center = TRUE, add.intercept = FALSE)
## Condition
## Index Variance Decomposition Proportions
## Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1 1.000 0.000 0.000 0.004 0.004 0.000 0.000
## 2 1.161 0.007 0.003 0.000 0.000 0.003 0.018
## 3 2.117 0.001 0.000 0.000 0.000 0.000 0.007
## 4 2.370 0.000 0.000 0.023 0.033 0.000 0.061
## 5 2.389 0.018 0.003 0.001 0.003 0.002 0.002
## 6 2.488 0.000 0.000 0.000 0.000 0.000 0.026
## 7 2.946 0.000 0.000 0.006 0.004 0.000 0.001
## 8 3.008 0.001 0.000 0.008 0.008 0.000 0.256
## 9 3.163 0.013 0.003 0.028 0.049 0.001 0.435
## 10 3.609 0.001 0.000 0.001 0.000 0.000 0.090
## 11 4.096 0.001 0.000 0.000 0.043 0.001 0.014
## 12 4.879 0.014 0.000 0.001 0.007 0.001 0.012
## 13 6.027 0.002 0.000 0.010 0.024 0.001 0.000
## 14 7.118 0.606 0.038 0.124 0.180 0.116 0.027
## 15 8.045 0.201 0.004 0.780 0.628 0.054 0.029
## 16 13.732 0.135 0.947 0.013 0.016 0.820 0.024
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1 0.006 0.007 0.000 0.001 0.004 0.004 0.006 0.007
## 2 0.002 0.000 0.002 0.010 0.002 0.002 0.005 0.004
## 3 0.001 0.011 0.356 0.149 0.006 0.003 0.043 0.010
## 4 0.012 0.154 0.018 0.029 0.010 0.018 0.003 0.028
## 5 0.007 0.037 0.005 0.079 0.063 0.059 0.006 0.000
## 6 0.000 0.013 0.434 0.089 0.004 0.010 0.155 0.002
## 7 0.005 0.042 0.044 0.562 0.000 0.000 0.046 0.024
## 8 0.000 0.050 0.068 0.019 0.000 0.001 0.005 0.515
## 9 0.000 0.057 0.022 0.009 0.014 0.026 0.087 0.024
## 10 0.025 0.116 0.015 0.033 0.010 0.005 0.287 0.229
## 11 0.020 0.201 0.003 0.001 0.001 0.003 0.346 0.032
## 12 0.821 0.290 0.003 0.006 0.000 0.000 0.000 0.080
## 13 0.000 0.008 0.029 0.002 0.843 0.789 0.007 0.008
## 14 0.084 0.006 0.000 0.008 0.002 0.003 0.002 0.024
## 15 0.017 0.000 0.001 0.002 0.038 0.074 0.000 0.004
## 16 0.001 0.006 0.000 0.001 0.000 0.002 0.002 0.009
## Expend Grad.Rate
## 1 0.008 0.009
## 2 0.000 0.001
## 3 0.016 0.017
## 4 0.002 0.013
## 5 0.001 0.061
## 6 0.036 0.024
## 7 0.027 0.269
## 8 0.004 0.001
## 9 0.003 0.001
## 10 0.016 0.468
## 11 0.531 0.084
## 12 0.169 0.038
## 13 0.015 0.008
## 14 0.056 0.005
## 15 0.119 0.002
## 16 0.000 0.000
3.4 Display the lm.fit model summary results and the variance inflation factors (VIF’s) for the predictors in the model.
summary(lm.fit.all)
##
## Call:
## lm(formula = Apps ~ Accept + Enroll + Top10perc + Top25perc +
## F.Undergrad + P.Undergrad + Outstate + Room.Board + Books +
## Personal + PhD + Terminal + S.F.Ratio + perc.alumni + Expend +
## Grad.Rate, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5060.6 -429.2 -18.0 309.7 7661.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.948e+02 3.916e+02 -2.285 0.02259 *
## Accept 1.591e+00 4.103e-02 38.784 < 2e-16 ***
## Enroll -8.903e-01 1.874e-01 -4.751 2.42e-06 ***
## Top10perc 4.968e+01 5.621e+00 8.838 < 2e-16 ***
## Top25perc -1.438e+01 4.514e+00 -3.185 0.00151 **
## F.Undergrad 7.303e-02 3.267e-02 2.235 0.02570 *
## P.Undergrad 4.990e-02 3.236e-02 1.542 0.12346
## Outstate -1.093e-01 1.804e-02 -6.058 2.17e-09 ***
## Room.Board 1.351e-01 4.846e-02 2.788 0.00544 **
## Books -9.177e-03 2.401e-01 -0.038 0.96952
## Personal 3.147e-02 6.357e-02 0.495 0.62068
## PhD -6.789e+00 4.644e+00 -1.462 0.14417
## Terminal -1.371e+00 5.105e+00 -0.269 0.78834
## S.F.Ratio 2.305e+01 1.293e+01 1.783 0.07506 .
## perc.alumni -1.170e+00 4.117e+00 -0.284 0.77640
## Expend 8.196e-02 1.239e-02 6.614 7.06e-11 ***
## Grad.Rate 8.041e+00 2.967e+00 2.710 0.00687 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1049 on 760 degrees of freedom
## Multiple R-squared: 0.928, Adjusted R-squared: 0.9265
## F-statistic: 612.1 on 16 and 760 DF, p-value: < 2.2e-16
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
vif(lm.fit.all)
## Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 7.126046 21.360983 6.928113 5.630782 17.697719 1.709601
## Outstate Room.Board Books Personal PhD Terminal
## 3.711710 1.990349 1.107372 1.305168 4.051287 3.979967
## S.F.Ratio perc.alumni Expend Grad.Rate
## 1.845581 1.833717 2.950354 1.829794
3.5 Briefly answer: based on your VIF results, is multicollinearity a problem? Why or why not? If so, which variables pose the main problem?
With some variable multicollinearity could be considered a problem because they are well over 5 or 10.
3.6 Fit a reduced model to predict Apps on Enroll and Top10perc only. Name it lm.fit.reduced. Display the CI (using scale=TRUE, center=TRUE, add.intercept=FALSE), model summary results and the VIF’s.
lm.fit.reduced <-lm(Apps~Enroll+Top10perc, data = College)
colldiag(lm.fit.reduced, scale = TRUE, center = TRUE, add.intercept = FALSE)
## Condition
## Index Variance Decomposition Proportions
## Enroll Top10perc
## 1 1.000 0.409 0.409
## 2 1.201 0.591 0.591
summary(lm.fit.reduced)
##
## Call:
## lm(formula = Apps ~ Enroll + Top10perc, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9468 -666 -115 409 32087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -795.11465 134.15208 -5.927 4.64e-09 ***
## Enroll 3.38249 0.07572 44.671 < 2e-16 ***
## Top10perc 42.03776 3.98841 10.540 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1927 on 774 degrees of freedom
## Multiple R-squared: 0.7526, Adjusted R-squared: 0.752
## F-statistic: 1177 on 2 and 774 DF, p-value: < 2.2e-16
vif(lm.fit.reduced)
## Enroll Top10perc
## 1.033984 1.033984
3.7 Is there a multicollinearity issue in the model above? Why or why not? Multicollinerairty isn’t issue because the variable were reduced.
4.1 Fit a large model with all variables that make sense from a business standpoint: Enroll, Top10perc, Outstate, Room.Board, PhD, S.F.Ratio, Expend and Grad.Rate. Name this model lm.fit.large. Display the model summary results.
lm.fit.large<-lm(Apps~Enroll+Top10perc+Outstate+Room.Board+PhD+S.F.Ratio+Expend+Grad.Rate, College)
summary(lm.fit.large)
##
## Call:
## lm(formula = Apps ~ Enroll + Top10perc + Outstate + Room.Board +
## PhD + S.F.Ratio + Expend + Grad.Rate, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8652.4 -617.6 -114.7 474.0 31465.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.922e+03 5.586e+02 -7.023 4.80e-12 ***
## Enroll 3.416e+00 8.293e-02 41.187 < 2e-16 ***
## Top10perc 1.613e+01 5.689e+00 2.834 0.00471 **
## Outstate -3.656e-02 2.940e-02 -1.243 0.21408
## Room.Board 4.400e-01 8.062e-02 5.458 6.49e-08 ***
## PhD -4.122e+00 5.146e+00 -0.801 0.42336
## S.F.Ratio 4.704e+01 2.229e+01 2.110 0.03519 *
## Expend 9.719e-02 2.104e-02 4.619 4.52e-06 ***
## Grad.Rate 1.493e+01 4.875e+00 3.064 0.00226 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1827 on 768 degrees of freedom
## Multiple R-squared: 0.7796, Adjusted R-squared: 0.7773
## F-statistic: 339.5 on 8 and 768 DF, p-value: < 2.2e-16
4.2 Then, compute the VIF’s for this large model and then conduct an ANOVA F test to evaluate if the larger model has more predictive power than the lm.fit.reduced model above. Provide your brief conclusions about what these two tests are telling you and pick a model based on this analysis.
vif(lm.fit.large)
## Enroll Top10perc Outstate Room.Board PhD S.F.Ratio
## 1.380973 2.342835 3.254419 1.818213 1.642117 1.811435
## Expend Grad.Rate
## 2.808017 1.631037
anova(lm.fit.reduced, lm.fit.large)
## Analysis of Variance Table
##
## Model 1: Apps ~ Enroll + Top10perc
## Model 2: Apps ~ Enroll + Top10perc + Outstate + Room.Board + PhD + S.F.Ratio +
## Expend + Grad.Rate
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 774 2875431853
## 2 768 2562315222 6 313116630 15.642 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4.3 Best Subset Selection. Fit the same lm.fit.large model above, but this time use the regsubsets(){leaps} function. Store the model summary results summary(lm.fit.large) in an object named large.sum (please note that we are storing the summary() object, not the lm() object). Display fit.large.sum to see all 8 models evaluated by regsubsets().
One nice thing about the regsubsets() function is that it provides various fit statistics for all the models tried. In this case, the default is 8 models (from 1 to 8 predictors), so the fit.large.sum\(rss and fit.large.sum\)adjr2 attributes contain 2 vectors with 8 elements each, containing the RSS and Adjusted R-Squared for each of the 8 models.
Display these RSS and AdjR2 values as a table by binding the 2 vectors with the cbind() function and naming the two columns “RSS” and “AdjR2” respectively.
library(leaps)
fit.large.sum <- regsubsets(Apps~Enroll+Top10perc+Outstate+Room.Board+PhD+S.F.Ratio+Expend+Grad.Rate, data = College)
summary(fit.large.sum)
## Subset selection object
## Call: regsubsets.formula(Apps ~ Enroll + Top10perc + Outstate + Room.Board +
## PhD + S.F.Ratio + Expend + Grad.Rate, data = College)
## 8 Variables (and intercept)
## Forced in Forced out
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## PhD FALSE FALSE
## S.F.Ratio FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## Enroll Top10perc Outstate Room.Board PhD S.F.Ratio Expend
## 1 ( 1 ) "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " "*"
## 3 ( 1 ) "*" " " " " "*" " " " " "*"
## 4 ( 1 ) "*" " " " " "*" " " " " "*"
## 5 ( 1 ) "*" "*" " " "*" " " " " "*"
## 6 ( 1 ) "*" "*" " " "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## Grad.Rate
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) "*"
## 5 ( 1 ) "*"
## 6 ( 1 ) "*"
## 7 ( 1 ) "*"
## 8 ( 1 ) "*"
ADJR2 = summary(fit.large.sum)$adjr2
RSS = summary(fit.large.sum)$rss
cbind(RSS, ADJR2)
## RSS ADJR2
## [1,] 3288139052 0.7167426
## [2,] 2796443926 0.7587885
## [3,] 2655057929 0.7706877
## [4,] 2608042300 0.7744566
## [5,] 2587933194 0.7759053
## [6,] 2570557773 0.7771208
## [7,] 2564455998 0.7773607
## [8,] 2562315222 0.7772569
4.4 Plot these RSS and AdjR2 side by side. Tip: (1) start with par(mfrow=c(1,2)) to split the display into 1 row and 2 columns; (2) then use the plot() functions with appropriate labels and use the attribute type=“l” to get a line; (3) then reset the display to a single plot with par(mfrow=c(1,1)). Based on your plot, which is the best model? Fit an lm() model with the predictors in your selected best model and display the summary() results.
par(mfrow=c(2,2))
plot(RSS, xlab="Number of Variables",ylab="RSS",type="l")
plot(ADJR2, xlab="Number of Variables",ylab="RSS",type="l")
fit.apps <- lm(Apps~Enroll+Room.Board+Expend+Grad.Rate, College)
summary(fit.apps)
##
## Call:
## lm(formula = Apps ~ Enroll + Room.Board + Expend + Grad.Rate,
## data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8151.2 -626.3 -118.1 443.3 31524.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.326e+03 3.232e+02 -10.291 < 2e-16 ***
## Enroll 3.519e+00 7.144e-02 49.253 < 2e-16 ***
## Room.Board 3.742e-01 7.281e-02 5.139 3.5e-07 ***
## Expend 9.192e-02 1.507e-02 6.098 1.7e-09 ***
## Grad.Rate 1.626e+01 4.358e+00 3.731 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1838 on 772 degrees of freedom
## Multiple R-squared: 0.7756, Adjusted R-squared: 0.7745
## F-statistic: 667.1 on 4 and 772 DF, p-value: < 2.2e-16
4.5 Let’s try a couple of Stepwise approaches to variable selection using the step(){stats} function. For both approaches, do a stepwise to select the optimal model between lm.fit.reduced and lm.fit.large (tip: the scope=list() functions should have the same scope for both models, from the lower bound model of lm.fit.reduced to the upper bound model of lm.fit.large). Also, in both cases, use direction=“both” (for stepwise) and test=“F” to get p-values for the predictors.
Name the first model lm.step.forward and use the lm.fit.reduced model as the starting base. Name the second model lm.step.back, use the lm.fit.large model as the starting base (Tip: the first approach will start with the reduced model and proceed forward towards the large model, but in a stepwise fashion. The second approach will start with the large model and proceed backwards towards the reduced model, but in a stepwise fashion).
After you model both stepwise approaches, display the summary() for both models.
lm.step.forward <- step(lm.fit.reduced, scope=list(lower= lm.fit.reduced, upper=lm.fit.large), direction="both", test="F")
## Start: AIC=11757.37
## Apps ~ Enroll + Top10perc
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## + Room.Board 1 218577239 2656854614 11698 63.5941 5.497e-15 ***
## + Expend 1 135660161 2739771692 11722 38.2752 9.955e-10 ***
## + Outstate 1 89091480 2786340372 11735 24.7162 8.186e-07 ***
## + Grad.Rate 1 76875307 2798556546 11738 21.2340 4.754e-06 ***
## + S.F.Ratio 1 8369163 2867062690 11757 2.2564 0.1335
## <none> 2875431853 11757
## + PhD 1 6774665 2868657188 11758 1.8255 0.1771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=11697.94
## Apps ~ Enroll + Top10perc + Room.Board
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## + Expend 1 45538227 2611316387 11686 13.4628 0.0002601 ***
## + Grad.Rate 1 22200694 2634653920 11693 6.5052 0.0109479 *
## <none> 2656854614 11698
## + Outstate 1 1265831 2655588783 11700 0.3680 0.5442821
## + S.F.Ratio 1 298859 2656555755 11700 0.0868 0.7683012
## + PhD 1 273803 2656580811 11700 0.0796 0.7779608
## - Room.Board 1 218577239 2875431853 11757 63.5941 5.497e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=11686.51
## Apps ~ Enroll + Top10perc + Room.Board + Expend
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## + Grad.Rate 1 23383194 2587933194 11682 6.9663 0.0084735 **
## + S.F.Ratio 1 15066751 2596249636 11684 4.4743 0.0347280 *
## <none> 2611316387 11686
## + Outstate 1 2092747 2609223640 11688 0.6184 0.4318893
## + PhD 1 1247624 2610068763 11688 0.3685 0.5439792
## - Expend 1 45538227 2656854614 11698 13.4628 0.0002601 ***
## - Room.Board 1 128455305 2739771692 11722 37.9761 1.153e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=11681.52
## Apps ~ Enroll + Top10perc + Room.Board + Expend + Grad.Rate
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## + S.F.Ratio 1 17375420 2570557773 11678 5.2047 0.0227976 *
## + Outstate 1 9759076 2578174118 11681 2.9147 0.0881813 .
## <none> 2587933194 11682
## + PhD 1 1750276 2586182918 11683 0.5211 0.4705834
## - Grad.Rate 1 23383194 2611316387 11686 6.9663 0.0084735 **
## - Expend 1 46720727 2634653920 11693 13.9191 0.0002049 ***
## - Room.Board 1 91424463 2679357657 11706 27.2373 2.316e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=11678.28
## Apps ~ Enroll + Top10perc + Room.Board + Expend + Grad.Rate +
## S.F.Ratio
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2570557773 11678
## + Outstate 1 6101776 2564455998 11678 1.8297 0.176557
## + PhD 1 3083978 2567473795 11679 0.9237 0.336807
## - S.F.Ratio 1 17375420 2587933194 11682 5.2047 0.022798 *
## - Grad.Rate 1 25691863 2596249636 11684 7.6959 0.005669 **
## - Expend 1 63525871 2634083644 11695 19.0289 1.463e-05 ***
## - Room.Board 1 95376286 2665934059 11705 28.5696 1.192e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.step.forward)
##
## Call:
## lm(formula = Apps ~ Enroll + Top10perc + Room.Board + Expend +
## Grad.Rate + S.F.Ratio, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8487.7 -651.9 -112.5 477.6 31538.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.122e+03 5.256e+02 -7.843 1.47e-14 ***
## Enroll 3.422e+00 7.709e-02 44.389 < 2e-16 ***
## Top10perc 1.372e+01 5.408e+00 2.536 0.01141 *
## Room.Board 3.875e-01 7.250e-02 5.345 1.19e-07 ***
## Expend 8.740e-02 2.003e-02 4.362 1.46e-05 ***
## Grad.Rate 1.286e+01 4.637e+00 2.774 0.00567 **
## S.F.Ratio 4.981e+01 2.183e+01 2.281 0.02280 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1827 on 770 degrees of freedom
## Multiple R-squared: 0.7788, Adjusted R-squared: 0.7771
## F-statistic: 452 on 6 and 770 DF, p-value: < 2.2e-16
lm.step.back <- step(lm.fit.large, scope=list(lower= lm.fit.reduced, upper=lm.fit.large), direction="both", test="F")
## Start: AIC=11679.79
## Apps ~ Enroll + Top10perc + Outstate + Room.Board + PhD + S.F.Ratio +
## Expend + Grad.Rate
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - PhD 1 2140775 2564455998 11678 0.6417 0.423361
## - Outstate 1 5158573 2567473795 11679 1.5462 0.214081
## <none> 2562315222 11680
## - S.F.Ratio 1 14851935 2577167157 11682 4.4516 0.035192 *
## - Grad.Rate 1 31313260 2593628482 11687 9.3855 0.002264 **
## - Expend 1 71183114 2633498336 11699 21.3356 4.519e-06 ***
## - Room.Board 1 99401404 2661716627 11707 29.7935 6.490e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=11678.44
## Apps ~ Enroll + Top10perc + Outstate + Room.Board + S.F.Ratio +
## Expend + Grad.Rate
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - Outstate 1 6101776 2570557773 11678 1.8297 0.176557
## <none> 2564455998 11678
## + PhD 1 2140775 2562315222 11680 0.6417 0.423361
## - S.F.Ratio 1 13718120 2578174118 11681 4.1136 0.042883 *
## - Grad.Rate 1 31183379 2595639376 11686 9.3509 0.002306 **
## - Expend 1 69627645 2634083643 11697 20.8791 5.697e-06 ***
## - Room.Board 1 97587525 2662043522 11706 29.2634 8.443e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=11678.28
## Apps ~ Enroll + Top10perc + Room.Board + S.F.Ratio + Expend +
## Grad.Rate
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2570557773 11678
## + Outstate 1 6101776 2564455998 11678 1.8297 0.176557
## + PhD 1 3083978 2567473795 11679 0.9237 0.336807
## - S.F.Ratio 1 17375420 2587933194 11682 5.2047 0.022798 *
## - Grad.Rate 1 25691863 2596249636 11684 7.6959 0.005669 **
## - Expend 1 63525871 2634083644 11695 19.0289 1.463e-05 ***
## - Room.Board 1 95376286 2665934059 11705 28.5696 1.192e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.step.back)
##
## Call:
## lm(formula = Apps ~ Enroll + Top10perc + Room.Board + S.F.Ratio +
## Expend + Grad.Rate, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8487.7 -651.9 -112.5 477.6 31538.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.122e+03 5.256e+02 -7.843 1.47e-14 ***
## Enroll 3.422e+00 7.709e-02 44.389 < 2e-16 ***
## Top10perc 1.372e+01 5.408e+00 2.536 0.01141 *
## Room.Board 3.875e-01 7.250e-02 5.345 1.19e-07 ***
## S.F.Ratio 4.981e+01 2.183e+01 2.281 0.02280 *
## Expend 8.740e-02 2.003e-02 4.362 1.46e-05 ***
## Grad.Rate 1.286e+01 4.637e+00 2.774 0.00567 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1827 on 770 degrees of freedom
## Multiple R-squared: 0.7788, Adjusted R-squared: 0.7771
## F-statistic: 452 on 6 and 770 DF, p-value: < 2.2e-16
4.6 Compare the two stepwise results above. Is there any difference? Also, compare your stewise model selection with the model selected above in 4.3 using regsubsets(). Are the models different? Which one would you pick? Is there an additional test to select the best of these models (no need to run the test, just answer the question)
The stepwise model are the same. After comparing the regsubesets to he stepwise they also appear to have similar adjusted R2’s
5.1 Ridge Regression: in this section you will be using the glmnet{glmnet} and other related {glmnet} functions to model an fit Ridge regression. First, create an x predictor matrix with all the predictors you used in fit.large above, and a y vector with the Apps response variable. Then fit a Ridge regression x and y. Name the resulting object ridge.mod and display it.
[For you only]: The ridge.mod display will be quite long because it will generate Ridge regressions with 100 different values of lambda. Think but don’t answer yet: look at how the deviance varies as the value of lambda reduces. The first regression with a very high lambda should be close to an OLS regression with no variables, just the intercept. This is called the “null” model. The last regression with a very small lambda should be close to an OLS regression with all predictors. This is called the “full” model. The trick here is to find the optimal lambda that minimizes the deviance. Can you spot the lambda that gives you the smallest lambda?
5.2 Using the cv.glmnet(){glmnet} function, compute the cross-validation statistics for the ridge.mod model above (tip: you need to use x and y per above, not ridge.mod). Store the results in an object named ridge.cv.
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
x=model.matrix(Apps~Enroll+Top10perc+Outstate+Room.Board+PhD+S.F.Ratio+Expend+Grad.Rate, College)
y=College$Apps
ridge.mod=glmnet(x,y,alpha=0,)
ridge.mod
##
## Call: glmnet(x = x, y = y, alpha = 0)
##
## Df %Dev Lambda
## [1,] 8 2.247e-36 3275000.0
## [2,] 8 2.873e-03 2984000.0
## [3,] 8 3.152e-03 2719000.0
## [4,] 8 3.458e-03 2478000.0
## [5,] 8 3.793e-03 2258000.0
## [6,] 8 4.161e-03 2057000.0
## [7,] 8 4.565e-03 1874000.0
## [8,] 8 5.007e-03 1708000.0
## [9,] 8 5.492e-03 1556000.0
## [10,] 8 6.023e-03 1418000.0
## [11,] 8 6.606e-03 1292000.0
## [12,] 8 7.244e-03 1177000.0
## [13,] 8 7.943e-03 1073000.0
## [14,] 8 8.709e-03 977200.0
## [15,] 8 9.549e-03 890400.0
## [16,] 8 1.047e-02 811300.0
## [17,] 8 1.147e-02 739200.0
## [18,] 8 1.257e-02 673600.0
## [19,] 8 1.378e-02 613700.0
## [20,] 8 1.510e-02 559200.0
## [21,] 8 1.654e-02 509500.0
## [22,] 8 1.812e-02 464300.0
## [23,] 8 1.984e-02 423000.0
## [24,] 8 2.172e-02 385400.0
## [25,] 8 2.378e-02 351200.0
## [26,] 8 2.602e-02 320000.0
## [27,] 8 2.847e-02 291600.0
## [28,] 8 3.114e-02 265700.0
## [29,] 8 3.404e-02 242100.0
## [30,] 8 3.721e-02 220600.0
## [31,] 8 4.066e-02 201000.0
## [32,] 8 4.441e-02 183100.0
## [33,] 8 4.848e-02 166800.0
## [34,] 8 5.290e-02 152000.0
## [35,] 8 5.769e-02 138500.0
## [36,] 8 6.288e-02 126200.0
## [37,] 8 6.850e-02 115000.0
## [38,] 8 7.457e-02 104800.0
## [39,] 8 8.113e-02 95480.0
## [40,] 8 8.819e-02 86990.0
## [41,] 8 9.579e-02 79270.0
## [42,] 8 1.040e-01 72220.0
## [43,] 8 1.127e-01 65810.0
## [44,] 8 1.221e-01 59960.0
## [45,] 8 1.321e-01 54630.0
## [46,] 8 1.428e-01 49780.0
## [47,] 8 1.542e-01 45360.0
## [48,] 8 1.662e-01 41330.0
## [49,] 8 1.790e-01 37660.0
## [50,] 8 1.924e-01 34310.0
## [51,] 8 2.066e-01 31260.0
## [52,] 8 2.215e-01 28490.0
## [53,] 8 2.371e-01 25960.0
## [54,] 8 2.533e-01 23650.0
## [55,] 8 2.702e-01 21550.0
## [56,] 8 2.876e-01 19630.0
## [57,] 8 3.057e-01 17890.0
## [58,] 8 3.242e-01 16300.0
## [59,] 8 3.431e-01 14850.0
## [60,] 8 3.624e-01 13530.0
## [61,] 8 3.820e-01 12330.0
## [62,] 8 4.017e-01 11240.0
## [63,] 8 4.216e-01 10240.0
## [64,] 8 4.415e-01 9328.0
## [65,] 8 4.613e-01 8499.0
## [66,] 8 4.808e-01 7744.0
## [67,] 8 5.002e-01 7056.0
## [68,] 8 5.191e-01 6429.0
## [69,] 8 5.375e-01 5858.0
## [70,] 8 5.555e-01 5338.0
## [71,] 8 5.728e-01 4864.0
## [72,] 8 5.894e-01 4432.0
## [73,] 8 6.053e-01 4038.0
## [74,] 8 6.204e-01 3679.0
## [75,] 8 6.347e-01 3352.0
## [76,] 8 6.482e-01 3055.0
## [77,] 8 6.608e-01 2783.0
## [78,] 8 6.726e-01 2536.0
## [79,] 8 6.836e-01 2311.0
## [80,] 8 6.937e-01 2105.0
## [81,] 8 7.030e-01 1918.0
## [82,] 8 7.115e-01 1748.0
## [83,] 8 7.193e-01 1593.0
## [84,] 8 7.263e-01 1451.0
## [85,] 8 7.327e-01 1322.0
## [86,] 8 7.385e-01 1205.0
## [87,] 8 7.436e-01 1098.0
## [88,] 8 7.482e-01 1000.0
## [89,] 8 7.523e-01 911.4
## [90,] 8 7.559e-01 830.4
## [91,] 8 7.591e-01 756.6
## [92,] 8 7.619e-01 689.4
## [93,] 8 7.644e-01 628.2
## [94,] 8 7.665e-01 572.4
## [95,] 8 7.684e-01 521.5
## [96,] 8 7.700e-01 475.2
## [97,] 8 7.714e-01 433.0
## [98,] 8 7.726e-01 394.5
## [99,] 8 7.737e-01 359.5
## [100,] 8 7.746e-01 327.5
5.3 Plot the ridge.cv object
ridge.cv = cv.glmnet(x,y, alpha = 0)
plot(ridge.cv)
5.4 Find the best lambda and store it in an object named bestlam. Display the bestlam object.
bestlam = ridge.cv$lambda.min
bestlam
## [1] 359.4596
5.5 Find the coefficients for a Ridge regression using the best lambda you just found using the keywords type=“coefficients” and s=bestlam.
coef(ridge.cv, bestlam)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -3.953459e+03
## (Intercept) .
## Enroll 3.065774e+00
## Top10perc 1.826907e+01
## Outstate -3.808982e-02
## Room.Board 3.840698e-01
## PhD 4.193003e+00
## S.F.Ratio 5.443065e+01
## Expend 8.933287e-02
## Grad.Rate 1.298359e+01
6.1 Principal Components Regression (PCR): we will use the pcr(){pls} function and the College{ISLR} data set for this exercise. First, set the seed to 1 (or any number you wish) to get replicable results. Then Split the data into a training subset of 70% of the data. Tip - you can use this function: train=sample(1:nrow(College), 0.7*nrow(College)), which will provide a vector named train with the indices for the training records (can you see why?).
We are attempting to reduce the dimenson of the model by only selecting a cetain percentage of the data to train with. This is accomplished by both using on portion of the data and removing some variables.
6.2 Fit a PCR model to predict collegue applications as a function of all remaining variables as predictors, using the training set (i.e., College[train,]). Store the results in an object named pcr.fit. Display the results summary for pcr.fit.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
pcr.fit=pcr(Apps~., data=College[train,],scale=TRUE,validation="CV")
summary(pcr.fit)
## Data: X dimension: 317 17
## Y dimension: 317 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 2952 2972 1431 1355 1361 1263 1139
## adjCV 2952 2972 1429 1350 1359 1229 1130
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1108 1103 1069 1051 1051 1057 1078
## adjCV 1103 1097 1058 1043 1047 1053 1073
## 14 comps 15 comps 16 comps 17 comps
## CV 1071 1064 912.6 902.3
## adjCV 1066 1071 908.0 898.1
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 32.3135 56.33 63.68 70.12 75.32 80.28 84.31
## Apps 0.5769 77.75 80.14 80.14 85.62 86.97 87.53
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 87.38 90.18 92.70 95.07 97.01 98.29 98.95
## Apps 87.73 88.64 88.71 88.76 88.84 88.84 89.01
## 15 comps 16 comps 17 comps
## X 99.42 99.85 100.00
## Apps 89.49 91.88 92.14
6.3 Display a scree plot for the MSE of pcr.fit using the validationplot(){pls} function. How many components would you select? Naturally, the MSE goes down as we add more components to the model. The key is to find the minimal number of components that provide an acceptable MSE. In other words, where is the “elbow”? The elbows are at component 1 and 5.
set.seed(1)
train=sample(1:nrow(College), 0.7*nrow(College))
set.seed(1)
pcr.fit=pcr(Apps~., data=College[train,],scale=TRUE,validation="CV")
validationplot(pcr.fit,val.type="RMSEP")
6.4 Regardless of your answer in 3 above, compute predicted values for the test data (tip: use College[-train,] for this) using 5 components.
6.5 Compute the MSE for the test data.
library(ISLR)
x=College$Apps
mse.test <- mean((x-predict(pcr.fit,College))[-train]^2)
mse.test
## [1] 2573802
6.6 Now fit a model using the full data set and 5 components. Store the results in an object named pcr.fit.5. Display the summary results for pcr.fit.5, its coefficients and loadings.
pcr.fit.5=pcr(Apps~., data=College, scale=TRUE, ncomp=5)
summary(pcr.fit.5)
## Data: X dimension: 777 17
## Y dimension: 777 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 31.670 57.30 64.30 69.90 75.39
## Apps 2.316 73.06 73.07 82.08 84.08
coefficients(pcr.fit.5)
## , , 5 comps
##
## Apps
## PrivateYes -180.78762
## Accept 1151.27243
## Enroll 1080.76811
## Top10perc 216.38940
## Top25perc 227.00817
## F.Undergrad 1001.26909
## P.Undergrad 390.05707
## Outstate 157.61035
## Room.Board 305.73454
## Books 89.71071
## Personal -114.07529
## PhD -156.09983
## Terminal -170.91013
## S.F.Ratio 62.28650
## perc.alumni -26.59023
## Expend 181.07726
## Grad.Rate 474.54767
loadings(pcr.fit.5)
##
## Loadings:
## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5
## PrivateYes -0.203 0.320 -0.149 -0.207
## Accept -0.419 -0.362 0.112
## Enroll -0.443 -0.250 0.175
## Top10perc -0.345 -0.130 0.221 0.332
## Top25perc -0.319 -0.161 0.252 0.344
## F.Undergrad -0.448 -0.202 0.136
## P.Undergrad 0.117 -0.297 -0.151 -0.107 -0.303
## Outstate -0.374 -0.203 -0.138
## Room.Board -0.283 -0.181 -0.383 -0.413
## Books -0.663 0.111 0.131
## Personal 0.133 -0.174 -0.472 0.334
## PhD -0.247 -0.255 0.205 0.346 -0.355
## Terminal -0.252 -0.243 0.143 0.319 -0.409
## S.F.Ratio 0.268 -0.125 0.292
## perc.alumni -0.290 0.160 0.212
## Expend -0.337 -0.217
## Grad.Rate -0.297 0.174 -0.240 0.250
##
## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5
## SS loadings 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.059 0.059 0.059 0.059 0.059
## Cumulative Var 0.059 0.118 0.176 0.235 0.294
6.7 Partial Least Squares (PLS): Fitting a PLS model is identical to fitting a PCR model, except that you need to use the plsr() function instead of the pcr() function. Copy and paste the entire script portion for PCR above and paste it into another r code segment. Then change pcr() with plsr() to fit a PLS model. You should also rename all the objects in your script (e.g., rename pcr.fit to pls.fit, etc.)
set.seed(1)
plsr.fit=plsr(Apps~., data=College[train,],scale=TRUE,validation="CV")
summary(plsr.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4149 2023 1624 1574 1483 1352 1242
## adjCV 4149 2020 1614 1577 1466 1336 1231
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1239 1233 1230 1228 1225 1226 1225
## adjCV 1228 1223 1219 1217 1215 1216 1215
## 14 comps 15 comps 16 comps 17 comps
## CV 1225 1225 1225 1225
## adjCV 1215 1215 1215 1215
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 26.58 34.50 62.81 65.71 70.16 73.53 77.38
## Apps 77.24 86.74 87.74 91.19 92.69 93.39 93.43
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 80.12 83.16 86.62 89.35 90.74 92.26 93.88
## Apps 93.49 93.52 93.54 93.55 93.56 93.57 93.57
## 15 comps 16 comps 17 comps
## X 96.08 98.25 100.00
## Apps 93.57 93.57 93.57
validationplot(plsr.fit,val.type="RMSEP")
mse.test <- mean((x-predict(plsr.fit,College))[-train]^2)
mse.test
## [1] 1252240
plsr.fit.5=plsr(Apps~., data=College, scale=TRUE, ncomp=5)
summary(plsr.fit.5)
## Data: X dimension: 777 17
## Y dimension: 777 1
## Fit method: kernelpls
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 25.76 40.33 62.59 64.97 66.87
## Apps 78.01 85.14 87.67 90.73 92.63
coefficients(plsr.fit.5)
## , , 5 comps
##
## Apps
## PrivateYes -189.85988
## Accept 3732.88397
## Enroll -100.37277
## Top10perc 601.62345
## Top25perc -128.78222
## F.Undergrad -311.36808
## P.Undergrad 85.34312
## Outstate -305.43943
## Room.Board 195.77152
## Books 29.32930
## Personal 49.15994
## PhD -69.26201
## Terminal -99.68608
## S.F.Ratio 153.60303
## perc.alumni 32.52532
## Expend 516.49285
## Grad.Rate 105.61206
loadings(plsr.fit.5)
##
## Loadings:
## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5
## PrivateYes -0.257 -0.161 0.326 -0.268 0.197
## Accept 0.432 0.331 0.352 0.236
## Enroll 0.441 0.300 -0.140 -0.189
## Top10perc 0.210 -0.452 0.282
## Top25perc 0.232 -0.449 0.233 -0.112
## F.Undergrad 0.436 0.278 -0.255 -0.218
## P.Undergrad 0.257 -0.250 -0.835 0.773
## Outstate -0.381 0.412 -0.187
## Room.Board -0.229 0.319 -0.233 0.151
## Books 0.142
## Personal 0.128 -0.245 0.244
## PhD 0.292 -0.585 0.332
## Terminal 0.282 -0.594 0.281
## S.F.Ratio 0.282 -0.316 0.170
## perc.alumni -0.400 0.286 -0.389 0.396
## Expend 0.144 -0.362 0.333 0.257 -0.146
## Grad.Rate -0.192 0.359 -0.542 0.192
##
## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5
## SS loadings 1.026 2.019 1.075 1.844 1.090
## Proportion Var 0.060 0.119 0.063 0.108 0.064
## Cumulative Var 0.060 0.179 0.242 0.351 0.415