Learning Outcomes measured in this assignment: LO1 to LO8
Content knowledge you’ll gain from doing this assignment: cross validation, build a linear model using subset selection, and communicate your results to a non-technical audience.
For this assignment, you can work alone or in a group of up to 3 people.
Submission: You have two options. Please choose as you wish.
If you choose to publish in RPubs, share your link in your assignment submission.
| R code | Decision/Why | Communication of findings | |
|---|---|---|---|
| Percentage of Assigned Points | 30% | 35% | 35% |
Decision/why?: Explain your reasoning behind your choice of the procedure, set of variables and such for the question.
Communication of your findings: Explain your results in terms of training MSE, testing MSE, and prediction of the variable Y
For this data set, we will use the data set College in R package ISLR. You can get more information here https://cran.r-project.org/web/packages/ISLR/ISLR.pdf
We are trying to predict number of applications received by colleges.
Variables are as follows:
Private: A factor with levels No and Yes indicating private or public universityApps: Number of applications receivedAccept: Number of applications acceptedEnroll: Number of new students enrolledTop10perc: Pct. new students from top 10% of H.S. classTop25perc: Pct. new students from top 25% of H.S. classF.Undergrad: Number of full time undergraduatesP.Undergrad: Number of part time undergraduatesOutstate: Out-of-state tuitionRoom.Board: Room and board costsBooks: Estimated book costsPersonal: Estimated personal spendingPhD: Pct. of faculty with Ph.D.’sTerminal: Pct. of faculty with terminal degreeS.F.Ratio: Student/faculty ratioperc.alumni: Pct. alumni who donateExpend: Instructional expenditure per studentGrad.Rate: Graduation rate(10 points) Please separate the testing and training sets:
set a seed: seed should be a fucntion of your birth date, i.e., if you are born in January 18, 2022, use the seed 18.
separate 40% of your data into testing set
head(College)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## Agnes Scott College Yes 417 349 137 60 89
## Alaska Pacific University Yes 193 146 55 16 44
## Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Agnes Scott College 510 63 12960 5450 450
## Alaska Pacific University 249 869 7560 4120 800
## Albertson College 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Agnes Scott College 875 92 97 7.7 37 19016
## Alaska Pacific University 1500 76 72 11.9 2 10922
## Albertson College 675 67 73 9.4 11 9727
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
Let’s see what the data looks like for number of applications and acceptance
plot(College[,"Apps"], College[,"Accept"], col="red")
Spiting Data
set.seed(20)
split = sample.split(College$Apps, SplitRatio = 0.6)
train_College = subset(College, split == TRUE)
test_College = subset(College, split == FALSE)
c(nrow(College), nrow(train_College), nrow(test_College))
## [1] 777 466 311
Scaling becomes necessary since the number of applications ranges from 400 to 19000, the model might ignore lower values if this isn’t done. Also, across all predictors, values range from 52 to 15,000.
train_College[, 2:18] = scale(train_College[, 2:18])
test_College[, 2:18] = scale(test_College[, 2:18])
For questions 2 to 4 use training set
(15 points) Model step: Consider 2 multiple linear regression models where Apps is the dependent variable. Please use training data while building the models.
regressor_1 = lm(formula = Apps~.,
data = train_College)
summary(regressor_1)
##
## Call:
## lm(formula = Apps ~ ., data = train_College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78523 -0.13091 -0.01563 0.08961 2.09963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1135306 0.0377736 3.006 0.002800 **
## PrivateYes -0.1542427 0.0480066 -3.213 0.001409 **
## Accept 0.8380028 0.0404797 20.702 < 2e-16 ***
## Enroll -0.0720311 0.0643853 -1.119 0.263846
## Top10perc 0.2528741 0.0357785 7.068 6.07e-12 ***
## Top25perc -0.1016865 0.0330199 -3.080 0.002201 **
## F.Undergrad 0.1225032 0.0543262 2.255 0.024618 *
## P.Undergrad 0.0213566 0.0169101 1.263 0.207264
## Outstate -0.0285570 0.0278478 -1.025 0.305696
## Room.Board 0.0396771 0.0190181 2.086 0.037518 *
## Books 0.0001919 0.0140273 0.014 0.989092
## Personal 0.0054131 0.0154085 0.351 0.725524
## PhD -0.0563640 0.0272693 -2.067 0.039314 *
## Terminal -0.0061314 0.0268102 -0.229 0.819209
## S.F.Ratio -0.0060064 0.0181261 -0.331 0.740521
## perc.alumni -0.0268909 0.0190078 -1.415 0.157843
## Expend 0.0813800 0.0229297 3.549 0.000427 ***
## Grad.Rate 0.0614625 0.0187591 3.276 0.001133 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2882 on 448 degrees of freedom
## Multiple R-squared: 0.92, Adjusted R-squared: 0.9169
## F-statistic: 302.9 on 17 and 448 DF, p-value: < 2.2e-16
regressor_2 = lm(formula = Apps~S.F.Ratio+PhD+Grad.Rate,
data = train_College)
summary(regressor_2)
##
## Call:
## lm(formula = Apps ~ S.F.Ratio + PhD + Grad.Rate, data = train_College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2343 -0.5334 -0.2291 0.2134 4.9977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.218e-16 4.192e-02 0.000 1.00000
## S.F.Ratio 1.384e-01 4.485e-02 3.087 0.00215 **
## PhD 4.260e-01 4.445e-02 9.583 < 2e-16 ***
## Grad.Rate 2.960e-02 4.667e-02 0.634 0.52625
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.905 on 462 degrees of freedom
## Multiple R-squared: 0.1863, Adjusted R-squared: 0.1811
## F-statistic: 35.27 on 3 and 462 DF, p-value: < 2.2e-16
From the model summary, it’s important to note that the lm function automatically removed the second categorical value of the Private variable, so as to not fall into the multicollinearity trap (linear dependency between predictors). Multicollinearity occurs when one of the predictor variable can predict, determine or have some linear dependency on another predictor variable, and in this case Private(Yes) = 1- Private(No).
The first model shows that some of the predictors have a linear relationship with the Number of Applications, the predictors like Private colleges (Yes), Acceptance Rate, Top 10%, Top 25%, Full-Time Undergrad, Room and board costs, Pct. of faculty with Ph.D.’s, Instructional expenditure per student, and Graduation Rate. Generally, since the p-value of both model is less than .05, we cannot conclude which is a better model, until we perform a prediction accuracy test
Also, the adjusted R-squared for the first model is high, which indicates that the model has a better goodness of fit.
regressor_fivecv1 = glm(formula = Apps~.,
data = train_College)
summary(regressor_fivecv1)
##
## Call:
## glm(formula = Apps ~ ., data = train_College)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.78523 -0.13091 -0.01563 0.08961 2.09963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1135306 0.0377736 3.006 0.002800 **
## PrivateYes -0.1542427 0.0480066 -3.213 0.001409 **
## Accept 0.8380028 0.0404797 20.702 < 2e-16 ***
## Enroll -0.0720311 0.0643853 -1.119 0.263846
## Top10perc 0.2528741 0.0357785 7.068 6.07e-12 ***
## Top25perc -0.1016865 0.0330199 -3.080 0.002201 **
## F.Undergrad 0.1225032 0.0543262 2.255 0.024618 *
## P.Undergrad 0.0213566 0.0169101 1.263 0.207264
## Outstate -0.0285570 0.0278478 -1.025 0.305696
## Room.Board 0.0396771 0.0190181 2.086 0.037518 *
## Books 0.0001919 0.0140273 0.014 0.989092
## Personal 0.0054131 0.0154085 0.351 0.725524
## PhD -0.0563640 0.0272693 -2.067 0.039314 *
## Terminal -0.0061314 0.0268102 -0.229 0.819209
## S.F.Ratio -0.0060064 0.0181261 -0.331 0.740521
## perc.alumni -0.0268909 0.0190078 -1.415 0.157843
## Expend 0.0813800 0.0229297 3.549 0.000427 ***
## Grad.Rate 0.0614625 0.0187591 3.276 0.001133 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.08307032)
##
## Null deviance: 465.000 on 465 degrees of freedom
## Residual deviance: 37.216 on 448 degrees of freedom
## AIC: 182.65
##
## Number of Fisher Scoring iterations: 2
regressor_fivecv2 = glm(formula = Apps~S.F.Ratio+PhD+Grad.Rate,
data = train_College)
summary(regressor_fivecv2)
##
## Call:
## glm(formula = Apps ~ S.F.Ratio + PhD + Grad.Rate, data = train_College)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2343 -0.5334 -0.2291 0.2134 4.9977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.218e-16 4.192e-02 0.000 1.00000
## S.F.Ratio 1.384e-01 4.485e-02 3.087 0.00215 **
## PhD 4.260e-01 4.445e-02 9.583 < 2e-16 ***
## Grad.Rate 2.960e-02 4.667e-02 0.634 0.52625
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.8189429)
##
## Null deviance: 465.00 on 465 degrees of freedom
## Residual deviance: 378.35 on 462 degrees of freedom
## AIC: 1235.4
##
## Number of Fisher Scoring iterations: 2
fivecv1<-cv.glm(train_College, regressor_fivecv1, K=5)
fivecv1$delta[1]
## [1] 0.09304838
fivecv2<-cv.glm(train_College, regressor_fivecv2, K=5)
fivecv2$delta[1]
## [1] 0.8206093
tencv1<-cv.glm(train_College, regressor_fivecv1, K=10)
tencv1$delta[1]
## [1] 0.0977646
tencv2<-cv.glm(train_College, regressor_fivecv2, K=10)
tencv2$delta[1]
## [1] 0.8256012
Based on the cross-validation techniques, the first model with all predictor variables is clearly the better model. It has the lesser scores of 0.09304838 and 0.0977646 for both 5-fold and 10-fold respectively.
best_sub=regsubsets(Apps~.,train_College, nvmax=20)
summary(best_sub)
## Subset selection object
## Call: regsubsets.formula(Apps ~ ., train_College, nvmax = 20)
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: exhaustive
## PrivateYes Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " " " "
## 3 ( 1 ) " " "*" " " "*" "*" " " " "
## 4 ( 1 ) "*" "*" " " "*" " " " " " "
## 5 ( 1 ) "*" "*" " " "*" " " " " " "
## 6 ( 1 ) "*" "*" " " "*" "*" " " " "
## 7 ( 1 ) "*" "*" " " "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 9 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 11 ( 1 ) "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " "*" " " " " " "
## 6 ( 1 ) " " " " " " " " "*" " " " " " "
## 7 ( 1 ) " " " " " " " " "*" " " " " " "
## 8 ( 1 ) " " " " " " " " "*" " " " " " "
## 9 ( 1 ) " " "*" " " " " "*" " " " " " "
## 10 ( 1 ) " " "*" " " " " "*" " " " " "*"
## 11 ( 1 ) " " "*" " " " " "*" " " " " "*"
## 12 ( 1 ) " " "*" " " " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" " " "*" "*" " " " " "*"
## 15 ( 1 ) "*" "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## Expend Grad.Rate
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
## 14 ( 1 ) "*" "*"
## 15 ( 1 ) "*" "*"
## 16 ( 1 ) "*" "*"
## 17 ( 1 ) "*" "*"
bsub_summary=summary(best_sub)
which.max(bsub_summary$adjr2)
## [1] 13
which.min(bsub_summary$cp)
## [1] 10
which.min(bsub_summary$bic)
## [1] 6
plot(best_sub ,scale ="bic")
coef(best_sub ,6)
## (Intercept) PrivateYes Accept Top10perc Top25perc PhD
## 0.11851044 -0.16100835 0.89040471 0.23874422 -0.08042434 -0.04861775
## Expend
## 0.07287606
best_forward=regsubsets(Apps~., train_College, nvmax=20, method="forward")
summary(best_forward)
## Subset selection object
## Call: regsubsets.formula(Apps ~ ., train_College, nvmax = 20, method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
## PrivateYes Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " " " "
## 3 ( 1 ) " " "*" " " "*" "*" " " " "
## 4 ( 1 ) "*" "*" " " "*" "*" " " " "
## 5 ( 1 ) "*" "*" " " "*" "*" " " " "
## 6 ( 1 ) "*" "*" " " "*" "*" " " " "
## 7 ( 1 ) "*" "*" " " "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 9 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 11 ( 1 ) "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " "*" " " " " " "
## 7 ( 1 ) " " " " " " " " "*" " " " " " "
## 8 ( 1 ) " " " " " " " " "*" " " " " " "
## 9 ( 1 ) " " "*" " " " " "*" " " " " " "
## 10 ( 1 ) " " "*" " " " " "*" " " " " "*"
## 11 ( 1 ) " " "*" " " " " "*" " " " " "*"
## 12 ( 1 ) " " "*" " " " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" " " "*" "*" " " " " "*"
## 15 ( 1 ) "*" "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## Expend Grad.Rate
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
## 14 ( 1 ) "*" "*"
## 15 ( 1 ) "*" "*"
## 16 ( 1 ) "*" "*"
## 17 ( 1 ) "*" "*"
bforward_summary=summary(best_forward)
which.max(bforward_summary$adjr2)
## [1] 13
which.min(bforward_summary$cp)
## [1] 10
which.min(bforward_summary$bic)
## [1] 6
plot(best_forward ,scale ="bic")
coef(best_forward ,6)
## (Intercept) PrivateYes Accept Top10perc Top25perc PhD
## 0.11851044 -0.16100835 0.89040471 0.23874422 -0.08042434 -0.04861775
## Expend
## 0.07287606
best_backward=regsubsets(Apps~., train_College, nvmax=20, method="backward")
summary(best_backward)
## Subset selection object
## Call: regsubsets.formula(Apps ~ ., train_College, nvmax = 20, method = "backward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: backward
## PrivateYes Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " " " "
## 4 ( 1 ) "*" "*" " " "*" " " " " " "
## 5 ( 1 ) "*" "*" " " "*" " " " " " "
## 6 ( 1 ) "*" "*" " " "*" "*" " " " "
## 7 ( 1 ) "*" "*" " " "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 9 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 11 ( 1 ) "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " "*" " " " " " "
## 6 ( 1 ) " " " " " " " " "*" " " " " " "
## 7 ( 1 ) " " " " " " " " "*" " " " " " "
## 8 ( 1 ) " " " " " " " " "*" " " " " " "
## 9 ( 1 ) " " "*" " " " " "*" " " " " " "
## 10 ( 1 ) " " "*" " " " " "*" " " " " "*"
## 11 ( 1 ) " " "*" " " " " "*" " " " " "*"
## 12 ( 1 ) " " "*" " " " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" " " "*" "*" " " " " "*"
## 15 ( 1 ) "*" "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## Expend Grad.Rate
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
## 14 ( 1 ) "*" "*"
## 15 ( 1 ) "*" "*"
## 16 ( 1 ) "*" "*"
## 17 ( 1 ) "*" "*"
bbackward_summary=summary(best_backward)
which.max(bbackward_summary$adjr2)
## [1] 13
which.min(bbackward_summary$cp)
## [1] 10
which.min(bbackward_summary$bic)
## [1] 6
plot(best_backward ,scale ="bic")
coef(best_backward ,6)
## (Intercept) PrivateYes Accept Top10perc Top25perc PhD
## 0.11851044 -0.16100835 0.89040471 0.23874422 -0.08042434 -0.04861775
## Expend
## 0.07287606
In all of the subset selection algorithms observed, the 6th model is the most preferred, based on the Bayesian Information Criterion value. This means it must have been undoubtedly a very good choice.
The model has the predictors Private Universities, Acceptance Rate, Top10 %, Top 5%, and the percentage of faculties with PhD.
Apps and calculate the test MSE. Comment on your findings.mlr_6 = lm(formula = Apps~Private+Accept+Top10perc+Top25perc+PhD,
data = train_College)
mlr_pred = predict(mlr_6, newdata=test_College)
num_apps=cbind(obs=seq(1,nrow(test_College)),Applications=test_College[,"Apps"], mlr_pred)
head(num_apps)
## obs Applications mlr_pred
## Abilene Christian University 1 -0.3575197 -0.39506594
## Adelphi University 2 -0.2404481 -0.09585787
## Alaska Pacific University 3 -0.6840292 -0.82797628
## Albertson College 4 -0.5963368 -0.43348760
## Alfred University 5 -0.3414947 -0.24188375
## Allegheny College 6 -0.1367307 0.03193602
For some Applications, it had very good prediction, but for others it had an OK prediction.
mse_mlr6=mean((test_College[,"Apps"]-mlr_pred)^2)
mse_mlr6
## [1] 0.0751811
The MSE score is also low at 0.0751811, which shows it has a very good prediction accuracy.
Apps better, the one from question 5 or the one you calculated in this step? (Note: They may be the same, depending on which you chose)#data matrix
test_matrix=model.matrix(Apps~., data=test_College)
#validated prediction error
vl_errors =rep(NA,17)
for(i in 1:17){
coefi=coef(best_backward ,id=i)
predt=test_matrix[,names(coefi)] %*% coefi
vl_errors[i]=mean((test_College[, "Apps"]-predt)^2)
}
vl_errors
## [1] 0.10005530 0.07773430 0.07738394 0.07290140 0.07128823 0.07115771
## [7] 0.07144513 0.07810564 0.07863283 0.07988556 0.08009753 0.07820260
## [13] 0.07594090 0.07587710 0.07617728 0.07610395 0.07610249
which.min(vl_errors)
## [1] 6
Based on the cross validation methods, the 6th model has again proven to be the best model. Which means it passed both tests of model prediction accuracy.