Get the Data

Call the ISLR library and check the head of College (a built-in data frame with ISLR, use data() to check this.) Then reassign College to a dataframe called df

library(ISLR)
library(tidyverse)
library(caTools)
library(rpart)
library(rpart.plot)
head(College)
##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55
df <- College

EDA

Create a scatterplot of Grad.Rate versus Room.Board, colored by the Private column.

df %>% 
  ggplot(aes(x=Grad.Rate,y=Room.Board,color=Private))+geom_point()

Create a histogram of full time undergrad students, color by Private.

df %>% 
  ggplot(aes(x=F.Undergrad,fill=Private))+geom_histogram(bins = 50,color="black",position=position_stack(reverse=TRUE))

Create a histogram of Grad.Rate colored by Private. You should see something odd here.

df %>% 
  ggplot(aes(x=Grad.Rate,fill=Private))+geom_histogram(bins = 50,color="black",position=position_stack(reverse=TRUE))

What college had a Graduation Rate of above 100%? Answer: Cazenovia College

row.names(df[df$Grad.Rate>100,])
## [1] "Cazenovia College"

Change that college’s grad rate to 100%

df["Cazenovia College","Grad.Rate"] <- 100

Train Test Split

Split your data into training and testing sets 70/30. Use the caTools library to do this.

seed <- 103
set.seed(seed)
sample <- sample.split(df$Private, SplitRatio = 0.70)
train = subset(df, sample == TRUE)
test = subset(df, sample == FALSE)

Decision Tree

Use the rpart library to build a decision tree to predict whether or not a school is Private. Remember to only build your tree off the training data.

tree_model <- rpart(Private ~ . , method='class', data= train)
prp(tree_model,main="Tree Model")

Use predict() to predict the Private label on the test data.

label <- "Private"
predicted_values <- predict(tree_model, test[,-which(colnames(test)==label)])

Check the Head of the predicted values. You should notice that you actually have two columns with the probabilities.

head(predicted_values)
##                                                  No        Yes
## Allentown Coll. of St. Francis de Sales 0.009463722 0.99053628
## Alma College                            0.009463722 0.99053628
## Alverno College                         0.009463722 0.99053628
## Arkansas Tech University                0.940677966 0.05932203
## Auburn University-Main Campus           0.940677966 0.05932203
## Augustana College                       0.009463722 0.99053628

Turn these two columns into one column to match the original Yes/No Label for a Private column.

test$pred_tree <- predict(tree_model, test[,-which(colnames(test)==label)])[,"Yes"]
classification_point <- 0.5
test$predclass_tree <- ifelse(test$pred_tree>classification_point,'Yes','No')

Now use table() to create a confusion matrix of your tree model.

table(test$Private,test$predclass_tree)
##      
##        No Yes
##   No   52  12
##   Yes   8 161

Use the rpart.plot library and the prp() function to plot out your tree model.

prp(tree_model)

Random Forest

Now let’s build out a random forest model!

Call the randomForest package library

library(randomForest)

Now use randomForest() to build out a model to predict Private class. Add importance=TRUE as a parameter in the model. (Use help(randomForest) to find out what this does.

set.seed(seed)
rf_base_model <- randomForest(Private ~ ., data=train, importance= TRUE)

What was your model’s confusion matrix on its own training set? Use model$confusion.

rf_base_model$confusion
##      No Yes class.error
## No  131  17  0.11486486
## Yes  12 384  0.03030303

Grab the feature importance with model$importance. Refer to the reading for more info on what Gini[1] means.[2]

rf_base_model$importance
##                       No          Yes MeanDecreaseAccuracy
## Apps        0.0238133255 0.0101755361         0.0138568325
## Accept      0.0307921437 0.0122798504         0.0173249080
## Enroll      0.0415971547 0.0228858671         0.0278180143
## Top10perc   0.0059311386 0.0021041886         0.0030828902
## Top25perc   0.0048959431 0.0015434040         0.0024185770
## F.Undergrad 0.1556326507 0.0571528195         0.0835219385
## P.Undergrad 0.0633075492 0.0100878893         0.0245231721
## Outstate    0.1777833560 0.0512986709         0.0852932262
## Room.Board  0.0273393708 0.0121137940         0.0161435426
## Books       0.0001046904 0.0001805146         0.0001845344
## Personal    0.0067220612 0.0007998621         0.0024234479
## PhD         0.0110686904 0.0033653248         0.0054527819
## Terminal    0.0068069420 0.0044710436         0.0050534675
## S.F.Ratio   0.0509308681 0.0096446641         0.0206102534
## perc.alumni 0.0245073499 0.0031507079         0.0089556339
## Expend      0.0261864323 0.0088920828         0.0134741528
## Grad.Rate   0.0081583860 0.0027409916         0.0041937576
##             MeanDecreaseGini
## Apps                9.072187
## Accept             12.462075
## Enroll             20.243780
## Top10perc           3.977075
## Top25perc           3.359381
## F.Undergrad        37.408712
## P.Undergrad        18.314179
## Outstate           46.981655
## Room.Board         11.638274
## Books               2.027928
## Personal            4.979909
## PhD                 3.809210
## Terminal            3.954021
## S.F.Ratio          18.740341
## perc.alumni         5.463619
## Expend              8.734074
## Grad.Rate           4.555214
varImpPlot(rf_base_model,type=1,main="Variable Importance Plot, Random Forest")

Predictions: Now use your random forest model to predict on your test set!

test$pred_class_RF <- predict(rf_base_model,newdata = test[,-which(colnames(test)==label)])

table(test$Private,test$pred_class_RF)
##      
##        No Yes
##   No   51  13
##   Yes   6 163

Testing Out Caret

I wanted to further stretch myself on this project, so I studied up on the caret package, which I found very positive feedback on. The ability to centralize data pre-processing, model selection, and tuning into a single package is very appealing, as is the standardized language to apply predictions to the holdout dataset.

I’m first going to call the caret package:

library(caret)

There is no pre-processing to be done on this dataset, so I will skip straight to setting up train control settings. These setting specify the settings for resampling and cross validation. To name just a few, methods that can be called here are ‘boot’ for boostrap resampling, ‘cv’ for cross-validation, ‘repeatedcv’ for repeated cross validation, and ‘none’ to fit one model to the entire training dataset. For repeated cross validation, I’ve chosen 10 folds with 3 repeats.

set.seed(seed)
fitControl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 3)

Next, I can specify the parameters I want to grid search along, as well as the specific values to search. ‘m’ is the only parameter in the random forest model, so I will search along m between 2 and 15

grid <- expand.grid(.mtry=c(seq(2:15)))

Now I can train the RandomForest model in caret, optimizing for accuracy:

rf_caret_model <- train(
  Private~.
  ,method = "rf"
  ,data=train          
  ,trControl=fitControl
  ,tuneGrid = grid
  , metric="Accuracy")
print(rf_caret_model)
## Random Forest 
## 
## 544 samples
##  17 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 490, 489, 490, 489, 491, 490, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.9428793  0.8505318
##    2    0.9410607  0.8474642
##    3    0.9453601  0.8580377
##    4    0.9466059  0.8627959
##    5    0.9441588  0.8563263
##    6    0.9441816  0.8564876
##    7    0.9441592  0.8564225
##    8    0.9442045  0.8560237
##    9    0.9447994  0.8578371
##   10    0.9417466  0.8503610
##   11    0.9405116  0.8471336
##   12    0.9386822  0.8422321
##   13    0.9392999  0.8435310
##   14    0.9380766  0.8403647
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.

We can see that the best model utilizes m=4 by calling model$bestTune.

The plots below shows training accuracy as a function of m, to give an indication of the best model found through the grid search.

rf_caret_model$bestTune
##   mtry
## 4    4
plot(rf_caret_model,main="RF Model Tuning", ylab="Accuracy",xlab="parameter m, number of randomly selected predictors")

Comparing this variable importance plot to the random forest model generated by the randomForest package, very little has changed. Out of State tuition and number of fulltime undergraduates are far and away the most impactful variables. Cost of books was the least importance variable in both models.

plot(varImp(rf_caret_model),main="Caret-tuned variable importance plot")

The confusion matric of the best random forest model tuned through caret can be seen below.

test$pred_class_RF_tuned <- predict(rf_caret_model,test)
table(test$Private,test$pred_class_RF_tuned)
##      
##        No Yes
##   No   51  13
##   Yes   5 164

Model Comparison

With 3 models and 3 accuracies, it’s very easy to compare in a single dataframe and identify that the caret-tuned Random Forest Model is the optimal model with accuracy of .923

Models <- c("Tree","RandomForest","TunedRandomForest")
Model_Accuracies <- c(mean(test$Private==test$predclass_tree),
                      mean(test$Private==test$pred_class_RF),
                      mean(test$Private==test$pred_class_RF_tuned))
Model_comparison <- tibble(Models,Model_Accuracies)
Model_comparison<- Model_comparison %>% 
  arrange(desc(Model_Accuracies))
print(Model_comparison)
## # A tibble: 3 x 2
##   Models            Model_Accuracies
##   <chr>                        <dbl>
## 1 TunedRandomForest            0.923
## 2 RandomForest                 0.918
## 3 Tree                         0.914