For this project I will be exploring the use of tree methods to classify schools as Private or Public based off their features. Let’s start by getting the data which is included in the ISLR library, the College data frame.
A data frame with 777 observations on the following 18 variables. Private A factor with levels No and Yes indicating private or public university Apps Number of applications received Accept Number of applications accepted Enroll Number of new students enrolled Top10perc Pct. new students from top 10% of H.S. class Top25perc Pct. new students from top 25% of H.S. class F.Undergrad Number of fulltime undergraduates P.Undergrad Number of parttime undergraduates Outstate Out-of-state tuition Room.Board Room and board costs Books Estimated book costs Personal Estimated personal spending PhD Pct. of faculty with Ph.D.’s Terminal Pct. of faculty with terminal degree S.F.Ratio Student/faculty ratio perc.alumni Pct. alumni who donate Expend Instructional expenditure per student Grad.Rate Graduation rate
# Call the ISLR library and check the head of College (a built-in data frame with ISLR, use data() to check this.) Then reassign College to a dataframe called df
library(ISLR)
head(College)
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University 52 2885 537 7440
## Adelphi University 29 2683 1227 12280
## Adrian College 50 1036 99 11250
## Agnes Scott College 89 510 63 12960
## Alaska Pacific University 44 249 869 7560
## Albertson College 62 678 41 13500
## Room.Board Books Personal PhD Terminal
## Abilene Christian University 3300 450 2200 70 78
## Adelphi University 6450 750 1500 29 30
## Adrian College 3750 400 1165 53 66
## Agnes Scott College 5450 450 875 92 97
## Alaska Pacific University 4120 800 1500 76 72
## Albertson College 3335 500 675 67 73
## S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University 18.1 12 7041 60
## Adelphi University 12.2 16 10527 56
## Adrian College 12.9 30 8735 54
## Agnes Scott College 7.7 37 19016 59
## Alaska Pacific University 11.9 2 10922 15
## Albertson College 9.4 11 9727 55
df<-College
#EDA
#Let's explore the data!
#Create a scatterplot of Grad.Rate versus Room.Board, colored by the Private column
library(ggplot2)
ggplot(df,aes(Room.Board,Grad.Rate))+geom_point(aes(color=Private))
# Create a histogram of full time undergrad students, color by Private
ggplot(df,aes(F.Undergrad))+geom_histogram(aes(fill=Private),color="black",bins=50)
df['Cazenovia College','Grad.Rate']<-100
#Train Test Split
#Split the data into training and testing sets 70/30. Use the caTools library to do this
library(caTools)
set.seed(101)
sample<-sample.split(df$Private,SplitRatio = 0.7)
train<-subset(df, sample=TRUE)
test<-subset(df, sample=FALSE)
#Decision Tree
# Use the rpart library to build a decision tree to predict whether or not a school is Private.
library(rpart)
tree<-rpart(Private~., method = 'class', data = train)
pred.tree<-predict(tree,test)
head(pred.tree)
## No Yes
## Abilene Christian University 0.35714286 0.6428571
## Adelphi University 0.00462963 0.9953704
## Adrian College 0.00462963 0.9953704
## Agnes Scott College 0.00462963 0.9953704
## Alaska Pacific University 0.08823529 0.9117647
## Albertson College 0.00462963 0.9953704
# This shows that the probabability of whether or not the colleges are private universities or no.
#Turn these two columns into one column to match the original Yes/No Label for a Private column
pred.tree<-as.data.frame(pred.tree)
join<-function(x){
if (x >=0.5){
return('Yes')
}
else{
return('No')
}
}
pred.tree$Private<-sapply(pred.tree$Yes, join)
head(pred.tree)
## No Yes Private
## Abilene Christian University 0.35714286 0.6428571 Yes
## Adelphi University 0.00462963 0.9953704 Yes
## Adrian College 0.00462963 0.9953704 Yes
## Agnes Scott College 0.00462963 0.9953704 Yes
## Alaska Pacific University 0.08823529 0.9117647 Yes
## Albertson College 0.00462963 0.9953704 Yes
#Now use table() to create a confusion matrix of your tree model.
table(pred.tree$Private, test$Private)
##
## No Yes
## No 195 14
## Yes 17 551
** In conclusion, we are going to plot the decision tree**
#Use the rpart.plot library and the prp() function to plot out my tree model
library(rpart.plot)
prp(tree)
#Now let's build out a random forest model!
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
rf.model<-randomForest(Private~., data=train, importance=TRUE)
rf.model$confusion
## No Yes class.error
## No 184 28 0.13207547
## Yes 18 547 0.03185841