For this project I will be exploring the use of tree methods to classify schools as Private or Public based off their features. Let’s start by getting the data which is included in the ISLR library, the College data frame.

A data frame with 777 observations on the following 18 variables. Private A factor with levels No and Yes indicating private or public university Apps Number of applications received Accept Number of applications accepted Enroll Number of new students enrolled Top10perc Pct. new students from top 10% of H.S. class Top25perc Pct. new students from top 25% of H.S. class F.Undergrad Number of fulltime undergraduates P.Undergrad Number of parttime undergraduates Outstate Out-of-state tuition Room.Board Room and board costs Books Estimated book costs Personal Estimated personal spending PhD Pct. of faculty with Ph.D.’s Terminal Pct. of faculty with terminal degree S.F.Ratio Student/faculty ratio perc.alumni Pct. alumni who donate Expend Instructional expenditure per student Grad.Rate Graduation rate

#  Call the ISLR library and check the head of College (a built-in data frame with ISLR, use data() to check this.) Then reassign College to a dataframe called df
library(ISLR)
head(College)
##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55
df<-College
#EDA

#Let's explore the data!
#Create a scatterplot of Grad.Rate versus Room.Board, colored by the Private column
library(ggplot2)
ggplot(df,aes(Room.Board,Grad.Rate))+geom_point(aes(color=Private))

# Create a histogram of full time undergrad students, color by Private

ggplot(df,aes(F.Undergrad))+geom_histogram(aes(fill=Private),color="black",bins=50)

df['Cazenovia College','Grad.Rate']<-100
#Train Test Split
#Split the data into training and testing sets 70/30. Use the caTools library to do this

library(caTools)
set.seed(101)

sample<-sample.split(df$Private,SplitRatio = 0.7)
train<-subset(df, sample=TRUE)
test<-subset(df, sample=FALSE)

#Decision Tree
# Use the rpart library to build a decision tree to predict whether or not a school is Private.
library(rpart)
tree<-rpart(Private~., method = 'class', data = train)
pred.tree<-predict(tree,test)
head(pred.tree)
##                                      No       Yes
## Abilene Christian University 0.35714286 0.6428571
## Adelphi University           0.00462963 0.9953704
## Adrian College               0.00462963 0.9953704
## Agnes Scott College          0.00462963 0.9953704
## Alaska Pacific University    0.08823529 0.9117647
## Albertson College            0.00462963 0.9953704
# This shows that the probabability of whether or not the colleges are private universities or no.
#Turn these two columns into one column to match the original Yes/No Label for a Private column

pred.tree<-as.data.frame(pred.tree)
join<-function(x){
  if (x >=0.5){
    return('Yes')
  }
else{
  return('No')
  }
 } 

pred.tree$Private<-sapply(pred.tree$Yes, join)

head(pred.tree)
##                                      No       Yes Private
## Abilene Christian University 0.35714286 0.6428571     Yes
## Adelphi University           0.00462963 0.9953704     Yes
## Adrian College               0.00462963 0.9953704     Yes
## Agnes Scott College          0.00462963 0.9953704     Yes
## Alaska Pacific University    0.08823529 0.9117647     Yes
## Albertson College            0.00462963 0.9953704     Yes
#Now use table() to create a confusion matrix of your tree model.

table(pred.tree$Private, test$Private)
##      
##        No Yes
##   No  195  14
##   Yes  17 551

** In conclusion, we are going to plot the decision tree**

#Use the rpart.plot library and the prp() function to plot out my tree model

library(rpart.plot)
prp(tree)

#Now let's build out a random forest model!

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf.model<-randomForest(Private~., data=train, importance=TRUE)
rf.model$confusion
##      No Yes class.error
## No  184  28  0.13207547
## Yes  18 547  0.03185841