In this project we will use tree methods to classify schools as private or public based off of their features.

We’ll get our data, the College data frame, from the ISLR library.

Data

library(ISLR)
head(College)
##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55
df <- College

EDA

Let’s take a look at this data.

library(ggplot2)

ggplot(df, aes(Room.Board, Grad.Rate)) + geom_point(aes(color=Private))

ggplot(df, aes(F.Undergrad)) + geom_histogram(aes(fill=Private), color='black', bins=50)

ggplot(df, aes(Grad.Rate)) + geom_histogram(aes(fill=Private), color='black', bins=50)

subset(df, Grad.Rate > 100)
##                   Private Apps Accept Enroll Top10perc Top25perc
## Cazenovia College     Yes 3847   3433    527         9        35
##                   F.Undergrad P.Undergrad Outstate Room.Board Books
## Cazenovia College        1010          12     9384       4840   600
##                   Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Cazenovia College      500  22       47      14.3          20   7697
##                   Grad.Rate
## Cazenovia College       118

We see that there is actually a college with a graduation rate of more than 100%, which is complete bogus. Let’s fix that.

df['Cazenovia College', 'Grad.Rate'] <- 100

Train Test Split

library(caTools)

set.seed(101)

sample = sample.split(df$Private, SplitRatio = 0.70)
train = subset(df, sample == TRUE)
test = subset(df, sample == FALSE)

Decision Tree

library(rpart)
tree <- rpart(Private ~., method='class', data=train)

We’ve created a tree model on the data, using Private as the predictor variable. Let’s make some predictions

tree.preds <- predict(tree, test)

head(tree.preds)
##                                                  No       Yes
## Adrian College                          0.003311258 0.9966887
## Alfred University                       0.003311258 0.9966887
## Allegheny College                       0.003311258 0.9966887
## Allentown Coll. of St. Francis de Sales 0.003311258 0.9966887
## Alma College                            0.003311258 0.9966887
## Amherst College                         0.003311258 0.9966887

We ended up getting two columns, but we can easily turn this into a Yes/No label instead.

tree.preds <- as.data.frame(tree.preds)

joiner <- function(x){
  if (x>=0.5) {
    return('Yes')
  } else {
    return('No')
  }
}
tree.preds$Private <- sapply(tree.preds$Yes, joiner)

head(tree.preds)
##                                                  No       Yes Private
## Adrian College                          0.003311258 0.9966887     Yes
## Alfred University                       0.003311258 0.9966887     Yes
## Allegheny College                       0.003311258 0.9966887     Yes
## Allentown Coll. of St. Francis de Sales 0.003311258 0.9966887     Yes
## Alma College                            0.003311258 0.9966887     Yes
## Amherst College                         0.003311258 0.9966887     Yes
table(tree.preds$Private, test$Private)
##      
##        No Yes
##   No   57   9
##   Yes   7 160
library(rpart.plot)
prp(tree)

Random Forest

Let’s see if the random forest model does better than the tree model!

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf.model <- randomForest(Private ~., data=train, importance = TRUE)
rf.model$confusion
##      No Yes class.error
## No  128  20  0.13513514
## Yes  11 385  0.02777778
rf.model$importance
##                       No          Yes MeanDecreaseAccuracy
## Apps        0.0264825723 1.337054e-02         0.0167996115
## Accept      0.0266512523 1.350260e-02         0.0170035344
## Enroll      0.0321939269 2.745828e-02         0.0287445317
## Top10perc   0.0092354552 4.178847e-03         0.0055778027
## Top25perc   0.0072526625 3.441528e-03         0.0044470666
## F.Undergrad 0.1482062370 7.223168e-02         0.0927247390
## P.Undergrad 0.0375883049 5.652122e-03         0.0143648091
## Outstate    0.1463307045 6.605948e-02         0.0874811571
## Room.Board  0.0186034224 1.360015e-02         0.0150377245
## Books       0.0005445485 2.386036e-05         0.0001481889
## Personal    0.0035610824 1.157282e-03         0.0018302362
## PhD         0.0099422655 5.495726e-03         0.0066942156
## Terminal    0.0039372541 5.337745e-03         0.0049689898
## S.F.Ratio   0.0294975573 8.547645e-03         0.0142013969
## perc.alumni 0.0232705720 3.067149e-03         0.0086415374
## Expend      0.0233150923 1.172527e-02         0.0148611854
## Grad.Rate   0.0183429022 4.794571e-03         0.0085484363
##             MeanDecreaseGini
## Apps                9.827595
## Accept             11.883250
## Enroll             20.307245
## Top10perc           5.525514
## Top25perc           4.352324
## F.Undergrad        42.406669
## P.Undergrad        14.677097
## Outstate           43.328004
## Room.Board         11.560456
## Books               2.355311
## Personal            3.582587
## PhD                 4.442297
## Terminal            4.481134
## S.F.Ratio          14.412133
## perc.alumni         4.904383
## Expend             10.044783
## Grad.Rate           6.843718

Predictions

p <- predict(rf.model, test)

table(p, test$Private)
##      
## p      No Yes
##   No   56   5
##   Yes   8 164

Looks like the random forest did do better than our tree method!