rforest Algorithm for making the predction model

Steven dataset is about the decisions of the US courts which were reversed by the high court or not. Here I am going to make the prediction model whether high court is going to reverse the decision or not on the basis of given aspects of the case.

loading the steven data.

steven<-read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\stevens.csv", header = TRUE)
summary(steven)

##      Docket         Term         Circuit                  Issue    
##  00-1011:  1   Min.   :1994   9th    :122   CriminalProcedure:132  
##  00-1045:  1   1st Qu.:1995   5th    : 53   JudicialPower    :102  
##  00-1072:  1   Median :1997   11th   : 49   EconomicActivity : 98  
##  00-1073:  1   Mean   :1997   7th    : 47   CivilRights      : 74  
##  00-1089:  1   3rd Qu.:1999   4th    : 46   DueProcess       : 43  
##  00-121 :  1   Max.   :2001   8th    : 44   FirstAmendment   : 39  
##  (Other):560                  (Other):205   (Other)          : 78  
##                Petitioner               Respondent    LowerCourt 
##  OTHER              :175   OTHER             :177   conser :293  
##  CRIMINAL.DEFENDENT : 89   BUSINESS          : 80   liberal:273  
##  BUSINESS           : 79   US                : 69                
##  STATE              : 48   CRIMINAL.DEFENDENT: 58                
##  US                 : 48   STATE             : 56                
##  GOVERNMENT.OFFICIAL: 38   EMPLOYEE          : 28                
##  (Other)            : 89   (Other)           : 98                
##     Unconst          Reverse      
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000  
##  Mean   :0.2473   Mean   :0.5459  
##  3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000  
##

str(steven)

## 'data.frame':    566 obs. of  9 variables:
##  $ Docket    : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
##  $ Term      : int  1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
##  $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
##  $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
##  $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...

Now lets split the data into training set and test set using “caTools” package

library(caTools)

## Warning: package 'caTools' was built under R version 3.3.3

set.seed(3000)
split = sample.split(steven$Reverse, SplitRatio = 0.7)
train = subset(steven, split == TRUE)
test = subset(steven, split == FALSE)

Applying random forest algorithm to make the prediction model using “randomForest” package

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.3.3

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

train$Reverse<-as.factor(train$Reverse)
test$Reverse<-as.factor(test$Reverse)

stevenrforest<-randomForest(Reverse~ Circuit+Issue+Petitioner+Respondent+LowerCourt+Unconst, data = train, nodesize=25, ntree=200)

predictforest<-predict(stevenrforest, newdata = test)

confmat<-table(test$Reverse, predictforest)

accuracy <- sum(diag(confmat))/sum(confmat)

accuracy

## [1] 0.6705882

Improving the accuracy using K-fold cross validation.

library(caret)

## Warning: package 'caret' was built under R version 3.3.3

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.3.3

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

library(e1071)

## Warning: package 'e1071' was built under R version 3.3.3

numfolds<-trainControl(method = "cv", number = 10)
cpGrid<-expand.grid(.cp=seq(0.01, 0.5, 0.01))

train(Reverse~ Circuit+Issue+Petitioner+Respondent+LowerCourt+Unconst, data=train, method="rpart", trControl=numfolds, tuneGrid=cpGrid)

## CART 
## 
## 396 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 356, 357, 356, 357, 357, 356, ... 
## Resampling results across tuning parameters:
## 
##   cp    Accuracy   Kappa       
##   0.01  0.6131410   0.204106712
##   0.02  0.6208333   0.224034127
##   0.03  0.6183974   0.220763981
##   0.04  0.6283974   0.247385299
##   0.05  0.6437821   0.283349018
##   0.06  0.6437821   0.283349018
##   0.07  0.6437821   0.283349018
##   0.08  0.6437821   0.283349018
##   0.09  0.6437821   0.283349018
##   0.10  0.6437821   0.283349018
##   0.11  0.6437821   0.283349018
##   0.12  0.6437821   0.283349018
##   0.13  0.6437821   0.283349018
##   0.14  0.6437821   0.283349018
##   0.15  0.6437821   0.283349018
##   0.16  0.6437821   0.283349018
##   0.17  0.6437821   0.283349018
##   0.18  0.6437821   0.283349018
##   0.19  0.6212821   0.228575149
##   0.20  0.6012821   0.179080199
##   0.21  0.5759615   0.113236131
##   0.22  0.5454487   0.027680576
##   0.23  0.5454487   0.027680576
##   0.24  0.5403846  -0.002040816
##   0.25  0.5403846  -0.002040816
##   0.26  0.5453846   0.000000000
##   0.27  0.5453846   0.000000000
##   0.28  0.5453846   0.000000000
##   0.29  0.5453846   0.000000000
##   0.30  0.5453846   0.000000000
##   0.31  0.5453846   0.000000000
##   0.32  0.5453846   0.000000000
##   0.33  0.5453846   0.000000000
##   0.34  0.5453846   0.000000000
##   0.35  0.5453846   0.000000000
##   0.36  0.5453846   0.000000000
##   0.37  0.5453846   0.000000000
##   0.38  0.5453846   0.000000000
##   0.39  0.5453846   0.000000000
##   0.40  0.5453846   0.000000000
##   0.41  0.5453846   0.000000000
##   0.42  0.5453846   0.000000000
##   0.43  0.5453846   0.000000000
##   0.44  0.5453846   0.000000000
##   0.45  0.5453846   0.000000000
##   0.46  0.5453846   0.000000000
##   0.47  0.5453846   0.000000000
##   0.48  0.5453846   0.000000000
##   0.49  0.5453846   0.000000000
##   0.50  0.5453846   0.000000000
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.18.

stevenstreecv<- rpart(Reverse~ Circuit+Issue+Petitioner+Respondent+LowerCourt+Unconst, data=train, method = "class", cp=0.18)

predictCV<-predict(stevenstreecv, newdata = test, type = "class")

confmat<-table(test$Reverse, predictCV)

accuracy <- sum(diag(confmat))/sum(confmat)

accuracy

## [1] 0.7235294

We can clearly see that using the complexity parameter increased the accuracy of the model