Steven dataset is about the decisions of the US courts which were reversed by the high court or not. Here I am going to make the prediction model whether high court is going to reverse the decision or not on the basis of given aspects of the case.
loading the steven data.
steven<-read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\stevens.csv", header = TRUE)
summary(steven)
## Docket Term Circuit Issue
## 00-1011: 1 Min. :1994 9th :122 CriminalProcedure:132
## 00-1045: 1 1st Qu.:1995 5th : 53 JudicialPower :102
## 00-1072: 1 Median :1997 11th : 49 EconomicActivity : 98
## 00-1073: 1 Mean :1997 7th : 47 CivilRights : 74
## 00-1089: 1 3rd Qu.:1999 4th : 46 DueProcess : 43
## 00-121 : 1 Max. :2001 8th : 44 FirstAmendment : 39
## (Other):560 (Other):205 (Other) : 78
## Petitioner Respondent LowerCourt
## OTHER :175 OTHER :177 conser :293
## CRIMINAL.DEFENDENT : 89 BUSINESS : 80 liberal:273
## BUSINESS : 79 US : 69
## STATE : 48 CRIMINAL.DEFENDENT: 58
## US : 48 STATE : 56
## GOVERNMENT.OFFICIAL: 38 EMPLOYEE : 28
## (Other) : 89 (Other) : 98
## Unconst Reverse
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000
## Mean :0.2473 Mean :0.5459
## 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
##
str(steven)
## 'data.frame': 566 obs. of 9 variables:
## $ Docket : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
## $ Term : int 1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
## $ Circuit : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
## $ Issue : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
## $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
## $ Unconst : int 0 0 0 0 0 1 0 1 0 0 ...
## $ Reverse : int 1 1 1 1 1 0 1 1 1 1 ...
Now lets split the data into training set and test set using “caTools” package
library(caTools)
## Warning: package 'caTools' was built under R version 3.3.3
set.seed(3000)
split = sample.split(steven$Reverse, SplitRatio = 0.7)
train = subset(steven, split == TRUE)
test = subset(steven, split == FALSE)
Applying random forest algorithm to make the prediction model using “randomForest” package
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
train$Reverse<-as.factor(train$Reverse)
test$Reverse<-as.factor(test$Reverse)
stevenrforest<-randomForest(Reverse~ Circuit+Issue+Petitioner+Respondent+LowerCourt+Unconst, data = train, nodesize=25, ntree=200)
predictforest<-predict(stevenrforest, newdata = test)
confmat<-table(test$Reverse, predictforest)
accuracy <- sum(diag(confmat))/sum(confmat)
accuracy
## [1] 0.6705882
Improving the accuracy using K-fold cross validation.
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(e1071)
## Warning: package 'e1071' was built under R version 3.3.3
numfolds<-trainControl(method = "cv", number = 10)
cpGrid<-expand.grid(.cp=seq(0.01, 0.5, 0.01))
train(Reverse~ Circuit+Issue+Petitioner+Respondent+LowerCourt+Unconst, data=train, method="rpart", trControl=numfolds, tuneGrid=cpGrid)
## CART
##
## 396 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 356, 357, 356, 357, 357, 356, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01 0.6131410 0.204106712
## 0.02 0.6208333 0.224034127
## 0.03 0.6183974 0.220763981
## 0.04 0.6283974 0.247385299
## 0.05 0.6437821 0.283349018
## 0.06 0.6437821 0.283349018
## 0.07 0.6437821 0.283349018
## 0.08 0.6437821 0.283349018
## 0.09 0.6437821 0.283349018
## 0.10 0.6437821 0.283349018
## 0.11 0.6437821 0.283349018
## 0.12 0.6437821 0.283349018
## 0.13 0.6437821 0.283349018
## 0.14 0.6437821 0.283349018
## 0.15 0.6437821 0.283349018
## 0.16 0.6437821 0.283349018
## 0.17 0.6437821 0.283349018
## 0.18 0.6437821 0.283349018
## 0.19 0.6212821 0.228575149
## 0.20 0.6012821 0.179080199
## 0.21 0.5759615 0.113236131
## 0.22 0.5454487 0.027680576
## 0.23 0.5454487 0.027680576
## 0.24 0.5403846 -0.002040816
## 0.25 0.5403846 -0.002040816
## 0.26 0.5453846 0.000000000
## 0.27 0.5453846 0.000000000
## 0.28 0.5453846 0.000000000
## 0.29 0.5453846 0.000000000
## 0.30 0.5453846 0.000000000
## 0.31 0.5453846 0.000000000
## 0.32 0.5453846 0.000000000
## 0.33 0.5453846 0.000000000
## 0.34 0.5453846 0.000000000
## 0.35 0.5453846 0.000000000
## 0.36 0.5453846 0.000000000
## 0.37 0.5453846 0.000000000
## 0.38 0.5453846 0.000000000
## 0.39 0.5453846 0.000000000
## 0.40 0.5453846 0.000000000
## 0.41 0.5453846 0.000000000
## 0.42 0.5453846 0.000000000
## 0.43 0.5453846 0.000000000
## 0.44 0.5453846 0.000000000
## 0.45 0.5453846 0.000000000
## 0.46 0.5453846 0.000000000
## 0.47 0.5453846 0.000000000
## 0.48 0.5453846 0.000000000
## 0.49 0.5453846 0.000000000
## 0.50 0.5453846 0.000000000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.18.
stevenstreecv<- rpart(Reverse~ Circuit+Issue+Petitioner+Respondent+LowerCourt+Unconst, data=train, method = "class", cp=0.18)
predictCV<-predict(stevenstreecv, newdata = test, type = "class")
confmat<-table(test$Reverse, predictCV)
accuracy <- sum(diag(confmat))/sum(confmat)
accuracy
## [1] 0.7235294
We can clearly see that using the complexity parameter increased the accuracy of the model