Cross Validation and Subset Selection Methods

Assignment Guidelines

Purpose:

Learning Outcomes measured in this assignment: LO1 to LO8
Content knowledge you’ll gain from doing this assignment: cross validation, build a linear model using subset selection, and communicate your results to a non-technical audience.

Criteria:

For this assignment, you can work alone or in a group of up to 3 people.
Submission: You have two options. Please choose as you wish.
1. Upload the knitted document on Canvas.
2. Publish your final output in RPubs. https://rpubs.com/about/getting-started
If you choose to publish in RPubs, share your link in your assignment submission.

The grading rubric can be found below:

	R code	Decision/Why	Communication of findings
Percentage of Assigned Points	30%	35%	35%

Decision/why?: Explain your reasoning behind your choice of the procedure, set of variables and such for the question.
- Explain why you use the procedure/model/variable
- To exceed this criterion, describe steps taken to implement the procedure in a non technical way.
Communication of your findings: Explain your results in terms of training MSE, testing MSE, and prediction of the variable Y
- Explain why you think one model is better than the other.
- To exceed this criterion, explain your model and how it predicts the variable of interest in a non technical way.

Data

For this data set, we will use the data set College in R package ISLR. You can get more information here https://cran.r-project.org/web/packages/ISLR/ISLR.pdf

We are trying to predict number of applications received by colleges.

Variables are as follows:

Private: A factor with levels No and Yes indicating private or public university
Apps: Number of applications received
Accept: Number of applications accepted
Enroll: Number of new students enrolled
Top10perc: Pct. new students from top 10% of H.S. class
Top25perc: Pct. new students from top 25% of H.S. class
F.Undergrad: Number of full time undergraduates
P.Undergrad: Number of part time undergraduates
Outstate: Out-of-state tuition
Room.Board: Room and board costs
Books: Estimated book costs
Personal: Estimated personal spending
PhD: Pct. of faculty with Ph.D.’s
Terminal: Pct. of faculty with terminal degree
S.F.Ratio: Student/faculty ratio
perc.alumni: Pct. alumni who donate
Expend: Instructional expenditure per student
Grad.Rate: Graduation rate

Questions

(10 points) Please separate the testing and training sets:
- set a seed: seed should be a fucntion of your birth date, i.e., if you are born in January 18, 2022, use the seed 18.
- separate 40% of your data into testing set

head(College)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55

Let’s see what the data looks like for number of applications and acceptance

plot(College[,"Apps"], College[,"Accept"], col="red")

Spiting Data

set.seed(20)
split = sample.split(College$Apps, SplitRatio = 0.6)
train_College = subset(College, split == TRUE)
test_College = subset(College, split == FALSE)

c(nrow(College), nrow(train_College), nrow(test_College))

## [1] 777 466 311

Scaling becomes necessary since the number of applications ranges from 400 to 19000, the model might ignore lower values if this isn’t done. Also, across all predictors, values range from 52 to 15,000.

train_College[, 2:18] = scale(train_College[, 2:18])
test_College[, 2:18] = scale(test_College[, 2:18])

For questions 2 to 4 use training set

(15 points) Model step: Consider 2 multiple linear regression models where Apps is the dependent variable. Please use training data while building the models.
- Model 1) Use all the rest of the variables.
- Model 2) Use only 3 variables of your choosing.

regressor_1 = lm(formula = Apps~., 
                data = train_College)
summary(regressor_1)

## 
## Call:
## lm(formula = Apps ~ ., data = train_College)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78523 -0.13091 -0.01563  0.08961  2.09963 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1135306  0.0377736   3.006 0.002800 ** 
## PrivateYes  -0.1542427  0.0480066  -3.213 0.001409 ** 
## Accept       0.8380028  0.0404797  20.702  < 2e-16 ***
## Enroll      -0.0720311  0.0643853  -1.119 0.263846    
## Top10perc    0.2528741  0.0357785   7.068 6.07e-12 ***
## Top25perc   -0.1016865  0.0330199  -3.080 0.002201 ** 
## F.Undergrad  0.1225032  0.0543262   2.255 0.024618 *  
## P.Undergrad  0.0213566  0.0169101   1.263 0.207264    
## Outstate    -0.0285570  0.0278478  -1.025 0.305696    
## Room.Board   0.0396771  0.0190181   2.086 0.037518 *  
## Books        0.0001919  0.0140273   0.014 0.989092    
## Personal     0.0054131  0.0154085   0.351 0.725524    
## PhD         -0.0563640  0.0272693  -2.067 0.039314 *  
## Terminal    -0.0061314  0.0268102  -0.229 0.819209    
## S.F.Ratio   -0.0060064  0.0181261  -0.331 0.740521    
## perc.alumni -0.0268909  0.0190078  -1.415 0.157843    
## Expend       0.0813800  0.0229297   3.549 0.000427 ***
## Grad.Rate    0.0614625  0.0187591   3.276 0.001133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2882 on 448 degrees of freedom
## Multiple R-squared:   0.92,  Adjusted R-squared:  0.9169 
## F-statistic: 302.9 on 17 and 448 DF,  p-value: < 2.2e-16

regressor_2 = lm(formula = Apps~S.F.Ratio+PhD+Grad.Rate, 
                data = train_College)
summary(regressor_2)

## 
## Call:
## lm(formula = Apps ~ S.F.Ratio + PhD + Grad.Rate, data = train_College)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2343 -0.5334 -0.2291  0.2134  4.9977 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.218e-16  4.192e-02   0.000  1.00000    
## S.F.Ratio    1.384e-01  4.485e-02   3.087  0.00215 ** 
## PhD          4.260e-01  4.445e-02   9.583  < 2e-16 ***
## Grad.Rate    2.960e-02  4.667e-02   0.634  0.52625    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.905 on 462 degrees of freedom
## Multiple R-squared:  0.1863, Adjusted R-squared:  0.1811 
## F-statistic: 35.27 on 3 and 462 DF,  p-value: < 2.2e-16

From the model summary, it’s important to note that the lm function automatically removed the second categorical value of the Private variable, so as to not fall into the multicollinearity trap (linear dependency between predictors). Multicollinearity occurs when one of the predictor variable can predict, determine or have some linear dependency on another predictor variable, and in this case Private(Yes) = 1- Private(No).

The first model shows that some of the predictors have a linear relationship with the Number of Applications, the predictors like Private colleges (Yes), Acceptance Rate, Top 10%, Top 25%, Full-Time Undergrad, Room and board costs, Pct. of faculty with Ph.D.’s, Instructional expenditure per student, and Graduation Rate. Generally, since the p-value of both model is less than .05, we cannot conclude which is a better model, until we perform a prediction accuracy test

Also, the adjusted R-squared for the first model is high, which indicates that the model has a better goodness of fit.

(15 points) Using 5-fold and 10-fold cross validation techniques, determine the better of the two models from question 2. Comment on your findings.

regressor_fivecv1 = glm(formula = Apps~., 
                data = train_College)
summary(regressor_fivecv1)

## 
## Call:
## glm(formula = Apps ~ ., data = train_College)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.78523  -0.13091  -0.01563   0.08961   2.09963  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1135306  0.0377736   3.006 0.002800 ** 
## PrivateYes  -0.1542427  0.0480066  -3.213 0.001409 ** 
## Accept       0.8380028  0.0404797  20.702  < 2e-16 ***
## Enroll      -0.0720311  0.0643853  -1.119 0.263846    
## Top10perc    0.2528741  0.0357785   7.068 6.07e-12 ***
## Top25perc   -0.1016865  0.0330199  -3.080 0.002201 ** 
## F.Undergrad  0.1225032  0.0543262   2.255 0.024618 *  
## P.Undergrad  0.0213566  0.0169101   1.263 0.207264    
## Outstate    -0.0285570  0.0278478  -1.025 0.305696    
## Room.Board   0.0396771  0.0190181   2.086 0.037518 *  
## Books        0.0001919  0.0140273   0.014 0.989092    
## Personal     0.0054131  0.0154085   0.351 0.725524    
## PhD         -0.0563640  0.0272693  -2.067 0.039314 *  
## Terminal    -0.0061314  0.0268102  -0.229 0.819209    
## S.F.Ratio   -0.0060064  0.0181261  -0.331 0.740521    
## perc.alumni -0.0268909  0.0190078  -1.415 0.157843    
## Expend       0.0813800  0.0229297   3.549 0.000427 ***
## Grad.Rate    0.0614625  0.0187591   3.276 0.001133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08307032)
## 
##     Null deviance: 465.000  on 465  degrees of freedom
## Residual deviance:  37.216  on 448  degrees of freedom
## AIC: 182.65
## 
## Number of Fisher Scoring iterations: 2

regressor_fivecv2 = glm(formula = Apps~S.F.Ratio+PhD+Grad.Rate, 
                data = train_College)
summary(regressor_fivecv2)

## 
## Call:
## glm(formula = Apps ~ S.F.Ratio + PhD + Grad.Rate, data = train_College)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2343  -0.5334  -0.2291   0.2134   4.9977  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.218e-16  4.192e-02   0.000  1.00000    
## S.F.Ratio    1.384e-01  4.485e-02   3.087  0.00215 ** 
## PhD          4.260e-01  4.445e-02   9.583  < 2e-16 ***
## Grad.Rate    2.960e-02  4.667e-02   0.634  0.52625    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.8189429)
## 
##     Null deviance: 465.00  on 465  degrees of freedom
## Residual deviance: 378.35  on 462  degrees of freedom
## AIC: 1235.4
## 
## Number of Fisher Scoring iterations: 2

fivecv1<-cv.glm(train_College, regressor_fivecv1, K=5)
fivecv1$delta[1]

## [1] 0.09304838

fivecv2<-cv.glm(train_College, regressor_fivecv2, K=5)
fivecv2$delta[1]

## [1] 0.8206093

tencv1<-cv.glm(train_College, regressor_fivecv1, K=10)
tencv1$delta[1]

## [1] 0.0977646

tencv2<-cv.glm(train_College, regressor_fivecv2, K=10)
tencv2$delta[1]

## [1] 0.8256012

Based on the cross-validation techniques, the first model with all predictor variables is clearly the better model. It has the lesser scores of 0.09304838 and 0.0977646 for both 5-fold and 10-fold respectively.

(20 points) Apply the three subset selection methods on the training set. Report your best model with respect to BIC criteria, on each of the three methods. Comment on your findings.

best_sub=regsubsets(Apps~.,train_College, nvmax=20)
summary(best_sub)

## Subset selection object
## Call: regsubsets.formula(Apps ~ ., train_College, nvmax = 20)
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Outstate        FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: exhaustive
##           PrivateYes Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1  ( 1 )  " "        "*"    " "    " "       " "       " "         " "        
## 2  ( 1 )  " "        "*"    " "    "*"       " "       " "         " "        
## 3  ( 1 )  " "        "*"    " "    "*"       "*"       " "         " "        
## 4  ( 1 )  "*"        "*"    " "    "*"       " "       " "         " "        
## 5  ( 1 )  "*"        "*"    " "    "*"       " "       " "         " "        
## 6  ( 1 )  "*"        "*"    " "    "*"       "*"       " "         " "        
## 7  ( 1 )  "*"        "*"    " "    "*"       "*"       " "         " "        
## 8  ( 1 )  "*"        "*"    " "    "*"       "*"       "*"         " "        
## 9  ( 1 )  "*"        "*"    " "    "*"       "*"       "*"         " "        
## 10  ( 1 ) "*"        "*"    " "    "*"       "*"       "*"         " "        
## 11  ( 1 ) "*"        "*"    " "    "*"       "*"       "*"         "*"        
## 12  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 13  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 14  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 15  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 16  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 17  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
##           Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 2  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 3  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 4  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 5  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 6  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 7  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 8  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 9  ( 1 )  " "      "*"        " "   " "      "*" " "      " "       " "        
## 10  ( 1 ) " "      "*"        " "   " "      "*" " "      " "       "*"        
## 11  ( 1 ) " "      "*"        " "   " "      "*" " "      " "       "*"        
## 12  ( 1 ) " "      "*"        " "   " "      "*" " "      " "       "*"        
## 13  ( 1 ) "*"      "*"        " "   " "      "*" " "      " "       "*"        
## 14  ( 1 ) "*"      "*"        " "   "*"      "*" " "      " "       "*"        
## 15  ( 1 ) "*"      "*"        " "   "*"      "*" " "      "*"       "*"        
## 16  ( 1 ) "*"      "*"        " "   "*"      "*" "*"      "*"       "*"        
## 17  ( 1 ) "*"      "*"        "*"   "*"      "*" "*"      "*"       "*"        
##           Expend Grad.Rate
## 1  ( 1 )  " "    " "      
## 2  ( 1 )  " "    " "      
## 3  ( 1 )  " "    " "      
## 4  ( 1 )  "*"    " "      
## 5  ( 1 )  "*"    " "      
## 6  ( 1 )  "*"    " "      
## 7  ( 1 )  "*"    "*"      
## 8  ( 1 )  "*"    "*"      
## 9  ( 1 )  "*"    "*"      
## 10  ( 1 ) "*"    "*"      
## 11  ( 1 ) "*"    "*"      
## 12  ( 1 ) "*"    "*"      
## 13  ( 1 ) "*"    "*"      
## 14  ( 1 ) "*"    "*"      
## 15  ( 1 ) "*"    "*"      
## 16  ( 1 ) "*"    "*"      
## 17  ( 1 ) "*"    "*"

bsub_summary=summary(best_sub)
which.max(bsub_summary$adjr2)

## [1] 13

which.min(bsub_summary$cp)

## [1] 10

which.min(bsub_summary$bic)

## [1] 6

plot(best_sub ,scale ="bic")

coef(best_sub ,6)

## (Intercept)  PrivateYes      Accept   Top10perc   Top25perc         PhD 
##  0.11851044 -0.16100835  0.89040471  0.23874422 -0.08042434 -0.04861775 
##      Expend 
##  0.07287606

Forward:

best_forward=regsubsets(Apps~., train_College, nvmax=20, method="forward")
summary(best_forward)

## Subset selection object
## Call: regsubsets.formula(Apps ~ ., train_College, nvmax = 20, method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Outstate        FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
##           PrivateYes Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1  ( 1 )  " "        "*"    " "    " "       " "       " "         " "        
## 2  ( 1 )  " "        "*"    " "    "*"       " "       " "         " "        
## 3  ( 1 )  " "        "*"    " "    "*"       "*"       " "         " "        
## 4  ( 1 )  "*"        "*"    " "    "*"       "*"       " "         " "        
## 5  ( 1 )  "*"        "*"    " "    "*"       "*"       " "         " "        
## 6  ( 1 )  "*"        "*"    " "    "*"       "*"       " "         " "        
## 7  ( 1 )  "*"        "*"    " "    "*"       "*"       " "         " "        
## 8  ( 1 )  "*"        "*"    " "    "*"       "*"       "*"         " "        
## 9  ( 1 )  "*"        "*"    " "    "*"       "*"       "*"         " "        
## 10  ( 1 ) "*"        "*"    " "    "*"       "*"       "*"         " "        
## 11  ( 1 ) "*"        "*"    " "    "*"       "*"       "*"         "*"        
## 12  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 13  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 14  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 15  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 16  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 17  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
##           Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 2  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 3  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 4  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 5  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 6  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 7  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 8  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 9  ( 1 )  " "      "*"        " "   " "      "*" " "      " "       " "        
## 10  ( 1 ) " "      "*"        " "   " "      "*" " "      " "       "*"        
## 11  ( 1 ) " "      "*"        " "   " "      "*" " "      " "       "*"        
## 12  ( 1 ) " "      "*"        " "   " "      "*" " "      " "       "*"        
## 13  ( 1 ) "*"      "*"        " "   " "      "*" " "      " "       "*"        
## 14  ( 1 ) "*"      "*"        " "   "*"      "*" " "      " "       "*"        
## 15  ( 1 ) "*"      "*"        " "   "*"      "*" " "      "*"       "*"        
## 16  ( 1 ) "*"      "*"        " "   "*"      "*" "*"      "*"       "*"        
## 17  ( 1 ) "*"      "*"        "*"   "*"      "*" "*"      "*"       "*"        
##           Expend Grad.Rate
## 1  ( 1 )  " "    " "      
## 2  ( 1 )  " "    " "      
## 3  ( 1 )  " "    " "      
## 4  ( 1 )  " "    " "      
## 5  ( 1 )  "*"    " "      
## 6  ( 1 )  "*"    " "      
## 7  ( 1 )  "*"    "*"      
## 8  ( 1 )  "*"    "*"      
## 9  ( 1 )  "*"    "*"      
## 10  ( 1 ) "*"    "*"      
## 11  ( 1 ) "*"    "*"      
## 12  ( 1 ) "*"    "*"      
## 13  ( 1 ) "*"    "*"      
## 14  ( 1 ) "*"    "*"      
## 15  ( 1 ) "*"    "*"      
## 16  ( 1 ) "*"    "*"      
## 17  ( 1 ) "*"    "*"

bforward_summary=summary(best_forward)
which.max(bforward_summary$adjr2)

## [1] 13

which.min(bforward_summary$cp)

## [1] 10

which.min(bforward_summary$bic)

## [1] 6

plot(best_forward ,scale ="bic")

coef(best_forward ,6)

## (Intercept)  PrivateYes      Accept   Top10perc   Top25perc         PhD 
##  0.11851044 -0.16100835  0.89040471  0.23874422 -0.08042434 -0.04861775 
##      Expend 
##  0.07287606

Backward:

best_backward=regsubsets(Apps~., train_College, nvmax=20, method="backward")
summary(best_backward)

## Subset selection object
## Call: regsubsets.formula(Apps ~ ., train_College, nvmax = 20, method = "backward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Outstate        FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: backward
##           PrivateYes Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1  ( 1 )  " "        "*"    " "    " "       " "       " "         " "        
## 2  ( 1 )  " "        "*"    " "    "*"       " "       " "         " "        
## 3  ( 1 )  "*"        "*"    " "    "*"       " "       " "         " "        
## 4  ( 1 )  "*"        "*"    " "    "*"       " "       " "         " "        
## 5  ( 1 )  "*"        "*"    " "    "*"       " "       " "         " "        
## 6  ( 1 )  "*"        "*"    " "    "*"       "*"       " "         " "        
## 7  ( 1 )  "*"        "*"    " "    "*"       "*"       " "         " "        
## 8  ( 1 )  "*"        "*"    " "    "*"       "*"       "*"         " "        
## 9  ( 1 )  "*"        "*"    " "    "*"       "*"       "*"         " "        
## 10  ( 1 ) "*"        "*"    " "    "*"       "*"       "*"         " "        
## 11  ( 1 ) "*"        "*"    " "    "*"       "*"       "*"         "*"        
## 12  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 13  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 14  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 15  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 16  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
## 17  ( 1 ) "*"        "*"    "*"    "*"       "*"       "*"         "*"        
##           Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 2  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 3  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 4  ( 1 )  " "      " "        " "   " "      " " " "      " "       " "        
## 5  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 6  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 7  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 8  ( 1 )  " "      " "        " "   " "      "*" " "      " "       " "        
## 9  ( 1 )  " "      "*"        " "   " "      "*" " "      " "       " "        
## 10  ( 1 ) " "      "*"        " "   " "      "*" " "      " "       "*"        
## 11  ( 1 ) " "      "*"        " "   " "      "*" " "      " "       "*"        
## 12  ( 1 ) " "      "*"        " "   " "      "*" " "      " "       "*"        
## 13  ( 1 ) "*"      "*"        " "   " "      "*" " "      " "       "*"        
## 14  ( 1 ) "*"      "*"        " "   "*"      "*" " "      " "       "*"        
## 15  ( 1 ) "*"      "*"        " "   "*"      "*" " "      "*"       "*"        
## 16  ( 1 ) "*"      "*"        " "   "*"      "*" "*"      "*"       "*"        
## 17  ( 1 ) "*"      "*"        "*"   "*"      "*" "*"      "*"       "*"        
##           Expend Grad.Rate
## 1  ( 1 )  " "    " "      
## 2  ( 1 )  " "    " "      
## 3  ( 1 )  " "    " "      
## 4  ( 1 )  "*"    " "      
## 5  ( 1 )  "*"    " "      
## 6  ( 1 )  "*"    " "      
## 7  ( 1 )  "*"    "*"      
## 8  ( 1 )  "*"    "*"      
## 9  ( 1 )  "*"    "*"      
## 10  ( 1 ) "*"    "*"      
## 11  ( 1 ) "*"    "*"      
## 12  ( 1 ) "*"    "*"      
## 13  ( 1 ) "*"    "*"      
## 14  ( 1 ) "*"    "*"      
## 15  ( 1 ) "*"    "*"      
## 16  ( 1 ) "*"    "*"      
## 17  ( 1 ) "*"    "*"

bbackward_summary=summary(best_backward)
which.max(bbackward_summary$adjr2)

## [1] 13

which.min(bbackward_summary$cp)

## [1] 10

which.min(bbackward_summary$bic)

## [1] 6

plot(best_backward ,scale ="bic")

coef(best_backward ,6)

## (Intercept)  PrivateYes      Accept   Top10perc   Top25perc         PhD 
##  0.11851044 -0.16100835  0.89040471  0.23874422 -0.08042434 -0.04861775 
##      Expend 
##  0.07287606

In all of the subset selection algorithms observed, the 6th model is the most preferred, based on the Bayesian Information Criterion value. This means it must have been undoubtedly a very good choice.

The model has the predictors Private Universities, Acceptance Rate, Top10 %, Top 5%, and the percentage of faculties with PhD.

(10 points) Using the testing data and one of the best models in question 4 (pick one you like the best), predict Apps and calculate the test MSE. Comment on your findings.

mlr_6 = lm(formula = Apps~Private+Accept+Top10perc+Top25perc+PhD,
           data = train_College)

mlr_pred = predict(mlr_6, newdata=test_College)

num_apps=cbind(obs=seq(1,nrow(test_College)),Applications=test_College[,"Apps"], mlr_pred)
head(num_apps)

##                              obs Applications    mlr_pred
## Abilene Christian University   1   -0.3575197 -0.39506594
## Adelphi University             2   -0.2404481 -0.09585787
## Alaska Pacific University      3   -0.6840292 -0.82797628
## Albertson College              4   -0.5963368 -0.43348760
## Alfred University              5   -0.3414947 -0.24188375
## Allegheny College              6   -0.1367307  0.03193602

For some Applications, it had very good prediction, but for others it had an OK prediction.

mse_mlr6=mean((test_College[,"Apps"]-mlr_pred)^2)
mse_mlr6

## [1] 0.0751811

The MSE score is also low at 0.0751811, which shows it has a very good prediction accuracy.

(15 points) Pick one of the subset selection methods from question 4 and calculate the cross validated prediction errors (Hint: R code from slide 59). Which model predicts the Apps better, the one from question 5 or the one you calculated in this step? (Note: They may be the same, depending on which you chose)

#data matrix
test_matrix=model.matrix(Apps~., data=test_College)

#validated prediction error
vl_errors =rep(NA,17)

for(i in 1:17){
  coefi=coef(best_backward ,id=i)
  predt=test_matrix[,names(coefi)] %*% coefi
  vl_errors[i]=mean((test_College[, "Apps"]-predt)^2)
}

vl_errors

##  [1] 0.10005530 0.07773430 0.07738394 0.07290140 0.07128823 0.07115771
##  [7] 0.07144513 0.07810564 0.07863283 0.07988556 0.08009753 0.07820260
## [13] 0.07594090 0.07587710 0.07617728 0.07610395 0.07610249

which.min(vl_errors)

## [1] 6

Based on the cross validation methods, the 6th model has again proven to be the best model. Which means it passed both tests of model prediction accuracy.

(15 points) Executive Summary Suppose that you are working for a university as a data scientist. One of your tasks is to develop a statistical model predicting number of applicants. Communicate your findings to the President of the university.