In this project we will use tree methods to classify schools as private or public based off of their features.
We’ll get our data, the College data frame, from the ISLR library.
library(ISLR)
head(College)
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University 52 2885 537 7440
## Adelphi University 29 2683 1227 12280
## Adrian College 50 1036 99 11250
## Agnes Scott College 89 510 63 12960
## Alaska Pacific University 44 249 869 7560
## Albertson College 62 678 41 13500
## Room.Board Books Personal PhD Terminal
## Abilene Christian University 3300 450 2200 70 78
## Adelphi University 6450 750 1500 29 30
## Adrian College 3750 400 1165 53 66
## Agnes Scott College 5450 450 875 92 97
## Alaska Pacific University 4120 800 1500 76 72
## Albertson College 3335 500 675 67 73
## S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University 18.1 12 7041 60
## Adelphi University 12.2 16 10527 56
## Adrian College 12.9 30 8735 54
## Agnes Scott College 7.7 37 19016 59
## Alaska Pacific University 11.9 2 10922 15
## Albertson College 9.4 11 9727 55
df <- College
Let’s take a look at this data.
library(ggplot2)
ggplot(df, aes(Room.Board, Grad.Rate)) + geom_point(aes(color=Private))
ggplot(df, aes(F.Undergrad)) + geom_histogram(aes(fill=Private), color='black', bins=50)
ggplot(df, aes(Grad.Rate)) + geom_histogram(aes(fill=Private), color='black', bins=50)
subset(df, Grad.Rate > 100)
## Private Apps Accept Enroll Top10perc Top25perc
## Cazenovia College Yes 3847 3433 527 9 35
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Cazenovia College 1010 12 9384 4840 600
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Cazenovia College 500 22 47 14.3 20 7697
## Grad.Rate
## Cazenovia College 118
We see that there is actually a college with a graduation rate of more than 100%, which is complete bogus. Let’s fix that.
df['Cazenovia College', 'Grad.Rate'] <- 100
library(caTools)
set.seed(101)
sample = sample.split(df$Private, SplitRatio = 0.70)
train = subset(df, sample == TRUE)
test = subset(df, sample == FALSE)
library(rpart)
tree <- rpart(Private ~., method='class', data=train)
We’ve created a tree model on the data, using Private as the predictor variable. Let’s make some predictions
tree.preds <- predict(tree, test)
head(tree.preds)
## No Yes
## Adrian College 0.003311258 0.9966887
## Alfred University 0.003311258 0.9966887
## Allegheny College 0.003311258 0.9966887
## Allentown Coll. of St. Francis de Sales 0.003311258 0.9966887
## Alma College 0.003311258 0.9966887
## Amherst College 0.003311258 0.9966887
We ended up getting two columns, but we can easily turn this into a Yes/No label instead.
tree.preds <- as.data.frame(tree.preds)
joiner <- function(x){
if (x>=0.5) {
return('Yes')
} else {
return('No')
}
}
tree.preds$Private <- sapply(tree.preds$Yes, joiner)
head(tree.preds)
## No Yes Private
## Adrian College 0.003311258 0.9966887 Yes
## Alfred University 0.003311258 0.9966887 Yes
## Allegheny College 0.003311258 0.9966887 Yes
## Allentown Coll. of St. Francis de Sales 0.003311258 0.9966887 Yes
## Alma College 0.003311258 0.9966887 Yes
## Amherst College 0.003311258 0.9966887 Yes
table(tree.preds$Private, test$Private)
##
## No Yes
## No 57 9
## Yes 7 160
library(rpart.plot)
prp(tree)
Let’s see if the random forest model does better than the tree model!
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
rf.model <- randomForest(Private ~., data=train, importance = TRUE)
rf.model$confusion
## No Yes class.error
## No 128 20 0.13513514
## Yes 11 385 0.02777778
rf.model$importance
## No Yes MeanDecreaseAccuracy
## Apps 0.0264825723 1.337054e-02 0.0167996115
## Accept 0.0266512523 1.350260e-02 0.0170035344
## Enroll 0.0321939269 2.745828e-02 0.0287445317
## Top10perc 0.0092354552 4.178847e-03 0.0055778027
## Top25perc 0.0072526625 3.441528e-03 0.0044470666
## F.Undergrad 0.1482062370 7.223168e-02 0.0927247390
## P.Undergrad 0.0375883049 5.652122e-03 0.0143648091
## Outstate 0.1463307045 6.605948e-02 0.0874811571
## Room.Board 0.0186034224 1.360015e-02 0.0150377245
## Books 0.0005445485 2.386036e-05 0.0001481889
## Personal 0.0035610824 1.157282e-03 0.0018302362
## PhD 0.0099422655 5.495726e-03 0.0066942156
## Terminal 0.0039372541 5.337745e-03 0.0049689898
## S.F.Ratio 0.0294975573 8.547645e-03 0.0142013969
## perc.alumni 0.0232705720 3.067149e-03 0.0086415374
## Expend 0.0233150923 1.172527e-02 0.0148611854
## Grad.Rate 0.0183429022 4.794571e-03 0.0085484363
## MeanDecreaseGini
## Apps 9.827595
## Accept 11.883250
## Enroll 20.307245
## Top10perc 5.525514
## Top25perc 4.352324
## F.Undergrad 42.406669
## P.Undergrad 14.677097
## Outstate 43.328004
## Room.Board 11.560456
## Books 2.355311
## Personal 3.582587
## PhD 4.442297
## Terminal 4.481134
## S.F.Ratio 14.412133
## perc.alumni 4.904383
## Expend 10.044783
## Grad.Rate 6.843718
p <- predict(rf.model, test)
table(p, test$Private)
##
## p No Yes
## No 56 5
## Yes 8 164
Looks like the random forest did do better than our tree method!