[ source files available on GitHub ]
Libraries needed for data processing and plotting:
library("dplyr")
library("magrittr")
library("ggplot2")
library("caTools")
library("rpart")
library("rpart.plot")
library("ROCR")
library("randomForest")
library("caret")
library("e1071")
In 2002, Andrew Martin, a professor of political science at Washington University in St. Louis, decided to instead predict decisions using a statistical model built from data Together with his colleagues, he decided to test this model against a panel of experts
Martin used a method called Classification and Regression Trees (CART)
Why not logistic regression?
Cases from 1994 through 2001.
In this period, the same nine justices presided SCOTUS:
We will focus on predicting Justice Stevens’ decisions:
In this problem, our dependent variable is whether or not Justice Stevens voted to reverse the lower court decision.
This is a binary variable, Reverse, taking values:
Our independent variables are six different properties of the case.
stevens_full <- read.csv("data/stevens.csv", stringsAsFactor = TRUE)
str(stevens_full)
## 'data.frame': 566 obs. of 9 variables:
## $ Docket : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
## $ Term : int 1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
## $ Circuit : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
## $ Issue : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
## $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
## $ Unconst : int 0 0 0 0 0 1 0 1 0 0 ...
## $ Reverse : int 1 1 1 1 1 0 1 1 1 1 ...
Some of the variables are not interesting for our purpose, namely Docket and Term and we remove them.
stevens <- stevens_full[, -c(1, 2)]
We can try to use logistic regression.
# model_LogRegr <- glm(Reverse ~ . - Docket - Term, data = stevens, family = binomial)
model_LogRegr <- glm(Reverse ~ ., data = stevens, family = binomial)
summary(model_LogRegr)
##
## Call:
## glm(formula = Reverse ~ ., family = binomial, data = stevens)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4748 -0.9222 0.4212 0.8805 2.2353
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.54410 2.08161 0.261 0.793797
## Circuit11th 0.99854 0.55292 1.806 0.070926 .
## Circuit1st 0.76049 0.72138 1.054 0.291783
## Circuit2nd 1.42816 0.58598 2.437 0.014801 *
## Circuit3rd 0.04341 0.61757 0.070 0.943960
## Circuit4th 1.48778 0.60068 2.477 0.013255 *
## Circuit5th 1.68235 0.56190 2.994 0.002753 **
## Circuit6th 1.48956 0.59973 2.484 0.013002 *
## Circuit7th 0.49603 0.55787 0.889 0.373925
## Circuit8th 0.26834 0.55420 0.484 0.628253
## Circuit9th 1.14267 0.47906 2.385 0.017069 *
## CircuitDC 0.61482 0.60133 1.022 0.306578
## CircuitFED 0.25634 0.64661 0.396 0.691786
## IssueCivilRights -0.03997 1.40847 -0.028 0.977360
## IssueCriminalProcedure -0.15254 1.41269 -0.108 0.914012
## IssueDueProcess 0.30429 1.43514 0.212 0.832085
## IssueEconomicActivity 0.13116 1.38507 0.095 0.924557
## IssueFederalTaxation -0.86018 1.54330 -0.557 0.577278
## IssueFederalismAndInterstateRelations 0.13246 1.44584 0.092 0.927005
## IssueFirstAmendment -0.75179 1.42809 -0.526 0.598592
## IssueJudicialPower -0.08953 1.38479 -0.065 0.948449
## IssuePrivacy 1.54454 1.61703 0.955 0.339492
## IssueUnions -0.44031 1.48202 -0.297 0.766390
## PetitionerBUSINESS 1.28748 1.31922 0.976 0.329094
## PetitionerCITY 0.21754 1.49548 0.145 0.884343
## PetitionerCRIMINAL.DEFENDENT 2.13822 1.33785 1.598 0.109987
## PetitionerEMPLOYEE 1.78408 1.37351 1.299 0.193973
## PetitionerEMPLOYER 0.65596 1.45800 0.450 0.652779
## PetitionerGOVERNMENT.OFFICIAL 1.30488 1.35869 0.960 0.336856
## PetitionerINJURED.PERSON 0.59468 1.51320 0.393 0.694323
## PetitionerOTHER 1.50279 1.29944 1.156 0.247482
## PetitionerPOLITICIAN 1.02443 1.39867 0.732 0.463904
## PetitionerSTATE 1.13253 1.36351 0.831 0.406202
## PetitionerUS 2.03763 1.36048 1.498 0.134206
## RespondentBUSINESS -1.71957 0.82967 -2.073 0.038211 *
## RespondentCITY -1.87125 1.11065 -1.685 0.092021 .
## RespondentCRIMINAL.DEFENDENT -3.05773 0.86421 -3.538 0.000403 ***
## RespondentEMPLOYEE -1.81206 0.91460 -1.981 0.047562 *
## RespondentEMPLOYER -0.90141 1.06608 -0.846 0.397815
## RespondentGOVERNMENT.OFFICIAL -2.56409 0.97349 -2.634 0.008440 **
## RespondentINJURED.PERSON -3.24236 1.03590 -3.130 0.001748 **
## RespondentOTHER -2.05311 0.79489 -2.583 0.009798 **
## RespondentPOLITICIAN -1.58367 0.95899 -1.651 0.098658 .
## RespondentSTATE -1.72107 0.91967 -1.871 0.061290 .
## RespondentUS -2.84583 0.88542 -3.214 0.001308 **
## LowerCourtliberal -1.16242 0.25050 -4.640 0.00000348 ***
## Unconst 0.08061 0.27981 0.288 0.773278
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 779.86 on 565 degrees of freedom
## Residual deviance: 622.81 on 519 degrees of freedom
## AIC: 716.81
##
## Number of Fisher Scoring iterations: 4
We get a model where some of the most significant variables are:
While this tells us that the case being from the 2nd or 4th circuit courts is predictive of Justice Stevens reversing the case, and the lower court decision being liberal is predictive of Justice Stevens affirming the case, it’s difficult to understand which factors are more important due to things like the scales of the variables, and the possibility of multicollinearity.
It’s also difficult to quickly evaluate what the prediction would be for a new case.
So instead of logistic regression, Martin and his colleagues used a method called classification and regression trees, or CART.
This method builds what is called a tree by splitting on the values of the independent variables.
To predict the outcome for a new observation or case, you can follow the splits in the tree and at the end, you predict the most frequent outcome in the training set that followed the same path.
Some advantages of CART are that:
This plot shows sample data for two independent variables, x and y, and each data point is colored by the outcome variable, red or gray.
CART tries to split this data into subsets so that each subset is as pure or homogeneous as possible. The first three splits that CART would create are shown here.
Then the standard prediction made by a CART model is just a majority vote within each subset.
A CART model is represented by what we call a tree.
The tree for the splits we just generated is shown on the right.
In the previous example shows a CART tree with three splits, but why not two, or four, or even five?
There are different ways to control how many splits are generated.
minbucket parameter, for the minimum number of observations in each bucket or subset.minbucket is, the more splits will be generated. But if it’s too small, overfitting will occur.minbucket parameter is too large, the model will be too simple and the accuracy will be poor.In each subset of a CART tree, we have a bucket of observations, which may contain both possible outcomes. In the Supreme Court case, we will be classifying observations as either affirm or reverse, again a binary outcome, as in the example shown above.
In the example we classified each subset as either red or gray depending on the majority in that subset.
Instead of just taking the majority outcome to be the prediction, we can compute the percentage of data in a subset of each type of outcome. As an example, if we have a subset with 10 affirms and two reverses, then 83% of the data is affirm.
Then, just like in logistic regression, we can use a threshold value to obtain our prediction.
For this example, we would predict affirm with a threshold of 0.5 since the majority is affirm.
But if we increase that threshold to 0.9, we would predict reverse for this example.
Then by varying the threshold value, we can compute an ROC curve and compute an AUC value to evaluate our model.
First we split our entire data set into training and test sets, with 70/30 split:
set.seed(3000)
spl <- sample.split(stevens$Reverse, SplitRatio = 0.7)
Train <- subset(stevens, spl == TRUE)
Test <- subset(stevens, spl == FALSE)
As a reference we also fit a logistic regression model to the training data set:
# model_LogRegr <- glm(Reverse ~ . - Docket - Term, data = Train, family = binomial)
model_LogRegr <- glm(Reverse ~ ., data = Train, family = binomial)
summary(model_LogRegr)
##
## Call:
## glm(formula = Reverse ~ ., family = binomial, data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3832 -0.9186 0.3458 0.8470 2.2290
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.45008 2.34027 0.192 0.84749
## Circuit11th 1.11621 0.65391 1.707 0.08783 .
## Circuit1st 0.45867 1.16360 0.394 0.69344
## Circuit2nd 2.31471 0.72699 3.184 0.00145 **
## Circuit3rd 0.46020 0.74762 0.616 0.53819
## Circuit4th 1.95459 0.72631 2.691 0.00712 **
## Circuit5th 2.03668 0.67009 3.039 0.00237 **
## Circuit6th 1.48370 0.71501 2.075 0.03798 *
## Circuit7th 0.69997 0.68327 1.024 0.30563
## Circuit8th 0.78445 0.67347 1.165 0.24411
## Circuit9th 1.34112 0.58353 2.298 0.02154 *
## CircuitDC 0.57449 0.72694 0.790 0.42936
## CircuitFED 0.63139 0.74354 0.849 0.39579
## IssueCivilRights 0.15302 1.41314 0.108 0.91377
## IssueCriminalProcedure -0.08072 1.42975 -0.056 0.95498
## IssueDueProcess 0.48455 1.44254 0.336 0.73694
## IssueEconomicActivity 0.21192 1.39093 0.152 0.87891
## IssueFederalTaxation -1.76887 1.71259 -1.033 0.30167
## IssueFederalismAndInterstateRelations 0.28026 1.48114 0.189 0.84992
## IssueFirstAmendment -0.50309 1.45309 -0.346 0.72918
## IssueJudicialPower -0.11009 1.37988 -0.080 0.93641
## IssuePrivacy 2.39866 1.79602 1.336 0.18170
## IssueUnions -0.78603 1.52914 -0.514 0.60723
## PetitionerBUSINESS 1.36136 1.60051 0.851 0.39500
## PetitionerCITY -0.19086 1.79841 -0.106 0.91548
## PetitionerCRIMINAL.DEFENDENT 2.01263 1.61931 1.243 0.21391
## PetitionerEMPLOYEE 2.09290 1.67320 1.251 0.21099
## PetitionerEMPLOYER 0.68498 1.75856 0.390 0.69690
## PetitionerGOVERNMENT.OFFICIAL 0.35138 1.62596 0.216 0.82890
## PetitionerINJURED.PERSON 1.46768 1.84218 0.797 0.42562
## PetitionerOTHER 1.40732 1.57153 0.896 0.37051
## PetitionerPOLITICIAN 0.77215 1.71668 0.450 0.65286
## PetitionerSTATE 0.82069 1.62757 0.504 0.61409
## PetitionerUS 1.45574 1.62834 0.894 0.37132
## RespondentBUSINESS -1.73754 0.98272 -1.768 0.07705 .
## RespondentCITY -2.63295 1.33944 -1.966 0.04933 *
## RespondentCRIMINAL.DEFENDENT -3.00803 1.01539 -2.962 0.00305 **
## RespondentEMPLOYEE -2.32134 1.09025 -2.129 0.03324 *
## RespondentEMPLOYER -0.92216 1.45769 -0.633 0.52699
## RespondentGOVERNMENT.OFFICIAL -2.38169 1.21421 -1.962 0.04982 *
## RespondentINJURED.PERSON -3.46752 1.20843 -2.869 0.00411 **
## RespondentOTHER -2.24797 0.94253 -2.385 0.01708 *
## RespondentPOLITICIAN -1.67445 1.12146 -1.493 0.13541
## RespondentSTATE -1.36931 1.09153 -1.254 0.20966
## RespondentUS -3.02756 1.04803 -2.889 0.00387 **
## LowerCourtliberal -0.95835 0.30572 -3.135 0.00172 **
## Unconst -0.18029 0.35504 -0.508 0.61159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 545.70 on 395 degrees of freedom
## Residual deviance: 428.71 on 349 degrees of freedom
## AIC: 522.71
##
## Number of Fisher Scoring iterations: 5
predict_LogRegr_Test <- predict(model_LogRegr, type = "response", newdata = Test)
cmat_LR <- table(Test$Reverse, predict_LogRegr_Test > 0.5)
cmat_LR
##
## FALSE TRUE
## 0 47 30
## 1 27 66
accu_LR <- (cmat_LR[1,1] + cmat_LR[2,2])/sum(cmat_LR)
model_CART <- rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst,
data = Train,
method = "class",
minbucket = 25)
A couple of notes about the parameters used in the function call:
method = "class" tells rpart to build a classification tree, instead of a regression tree.minbucket = 25 limits the tree so that it does not overfit to our training set.The model can be be represented as a decision tree:
# from 'rpart'
prp(model_CART)
Comparing this to a logistic regression model, we can see that it is very interpretable.
A CART tree is a series of decision rules which can easily be explained.
predict_CART_Test <- predict(model_CART, newdata = Test, type = "class")
type = "class" if we want the majority class predictions.We will see in shortly how we can leave this argument out and still get probabilities from our CART model.
cmat_CART <- table(Test$Reverse, predict_CART_Test)
cmat_CART
## predict_CART_Test
## 0 1
## 0 41 36
## 1 22 71
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
A couple of interesting remarks:
Reverse (the most common outcome) has an accuracy of 54.7%.So our CART model
We need to generate our predictions again, this time without the type = "class" argument.
model_CART_ROC <- predict(model_CART, newdata = Test)
We can take a look at what is the output of this prediction:
head(model_CART_ROC, 10)
## 0 1
## 1 0.3035714 0.6964286
## 3 0.3035714 0.6964286
## 4 0.4000000 0.6000000
## 6 0.4000000 0.6000000
## 8 0.4000000 0.6000000
## 21 0.3035714 0.6964286
## 32 0.5517241 0.4482759
## 36 0.5517241 0.4482759
## 40 0.3035714 0.6964286
## 42 0.5517241 0.4482759
For each observation in the test set, it gives two numbers which can be thought of as
More concretely, each test set observation is classified into a subset, or bucket, of our CART tree. These numbers give the percentage of training set data in that subset with outcome 0 and the percentage of data in the training set in that subset with outcome 1.
We will use the second column as our probabilities to generate an ROC curve.
First we use the prediction() function with first argument the second column of PredictROC, and second argument the true outcome values, Test$Reverse.
We pass the output of prediction() to performance() to which we give also two arguments for what we want on the X and Y axes of our ROC curve, true positive rate and false positive rate.
pred <- prediction(model_CART_ROC[,2], Test$Reverse)
perf <- performance(pred, "tpr", "fpr")
And the plot
plot(perf)
auc <- as.numeric(performance(pred, "auc")@y.values)
The AUC of the CART models is ==> 0.6927
The Random Forests method was designed to improve the prediction accuracy of CART and works by building a large number of CART trees.
Unfortunately, this makes the method less interpretable than CART, so often you need to decide if you value the interpretability or the increase in accuracy more.
To make a prediction for a new observation, each tree in the forest votes on the outcome and we pick the outcome that receives the majority of the votes.
How does random forests build many CART trees?
We can not just run CART multiple times because it would create the same tree every time. To prevent this, Random Forests …
Train$Reverse <- as.factor(Train$Reverse)
Test$Reverse <- as.factor(Test$Reverse)
model_RF <- randomForest(Reverse ~ ., data = Train, ntree = 200, nodesize = 25)
Some important parameter values need to be selected:
minbucket parameter from CART.A nice thing about random forests is that it is not as sensitive to the parameter values as CART is.
predict_RF_Test <- predict(model_RF, newdata = Test)
cmat_RF <- table(Test$Reverse, predict_RF_Test)
cmat_RF
## predict_RF_Test
## 0 1
## 0 41 36
## 1 18 75
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
Recall that our
Sometimes you will see a smaller improvement in accuracy and sometimes you’ll see that random forests can significantly improve in accuracy over CART.
Keep in mind that Random Forests has a random component.
You may have gotten a different confusion matrix than me because there is a random component to this method.
In CART, the value of minbucket can affect the model’s out-of-sample accuracy.
As we discussed above, if minbucket is too small, over-fitting might occur. On the other hand, if minbucket is too large, the model might be too simple.
So how should we set this parameter value?
We could select the value that gives the best testing set accuracy, but this would not be right. The idea of the testing set is to measure model performance on data the model has never seen before. By picking the value of minbucket to get the best test set performance, the testing set was implicitly used to generate the model.
Instead, we will use a method called K-fold Cross Validation, which is one way to properly select the parameter value.
This method works by going through the following steps.
Ultimately cross validation builds many models, one for each fold and possible parameter value. Then, for each candidate parameter value, and for each fold, we can compute the accuracy of the model.
This plot shows the possible parameter values on the X-axis, and the accuracy of the model on the Y-axis, with one line for each of the \(k\) repeats of the experiment.
We then average the accuracy over the \(k\) folds to determine the final parameter value that we want to use.
Typically, the behavior looks like the curves shown in the plot:
In this case, we would pick a parameter value around 6, because it leads to the maximum average accuracy over all parameter values.
So far, we have used the parameter minbucket to limit our tree in R.
When we use cross validation in R, we will use a parameter called cp instead, the complexity parameter.
It is like Adjusted \(R^2\) for linear regression, and AIC for logistic regression, in that it measures the trade-off between model complexity and accuracy on the training set.
A smaller cp value leads to a bigger tree, so a smaller cp value might over-fit the model to the training set. But a cp value that is too large might build a model that is too simple.
numFolds <- trainControl(method = "cv", number = 10)
cpGrid <- expand.grid( .cp = seq(0.01, 0.5, 0.01) )
This will define our cp parameters to test as numbers from 0.01 to 0.5, in increments of 0.01.
Perform the cross validation:
save_CV <- train(Reverse ~ .,
data = Train,
method = "rpart",
trControl = numFolds,
tuneGrid = cpGrid)
save_CV
## CART
##
## 396 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 357, 356, 356, 356, 357, 356, ...
##
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.01 0.6235897 0.22628099 0.082740495 0.16258461
## 0.02 0.6363462 0.25514793 0.085621687 0.17426118
## 0.03 0.6286538 0.24340943 0.084334361 0.17091851
## 0.04 0.6362179 0.26608908 0.090740922 0.18316429
## 0.05 0.6437179 0.28369695 0.085787232 0.16936166
## 0.06 0.6437179 0.28369695 0.085787232 0.16936166
## 0.07 0.6437179 0.28369695 0.085787232 0.16936166
## 0.08 0.6437179 0.28369695 0.085787232 0.16936166
## 0.09 0.6437179 0.28369695 0.085787232 0.16936166
## 0.10 0.6437179 0.28369695 0.085787232 0.16936166
## 0.11 0.6437179 0.28369695 0.085787232 0.16936166
## 0.12 0.6437179 0.28369695 0.085787232 0.16936166
## 0.13 0.6437179 0.28369695 0.085787232 0.16936166
## 0.14 0.6437179 0.28369695 0.085787232 0.16936166
## 0.15 0.6437179 0.28369695 0.085787232 0.16936166
## 0.16 0.6437179 0.28369695 0.085787232 0.16936166
## 0.17 0.6437179 0.28369695 0.085787232 0.16936166
## 0.18 0.6187179 0.22410099 0.070193397 0.15114731
## 0.19 0.5962179 0.17025484 0.046642368 0.11926357
## 0.20 0.5962179 0.17025484 0.046642368 0.11926357
## 0.21 0.5808333 0.13215960 0.035457791 0.10440039
## 0.22 0.5705769 0.10438182 0.030615192 0.09811583
## 0.23 0.5479487 0.02195219 0.007861390 0.04650747
## 0.24 0.5453846 0.01000000 0.005958436 0.03162278
## 0.25 0.5453846 0.00000000 0.005958436 0.00000000
## 0.26 0.5453846 0.00000000 0.005958436 0.00000000
## 0.27 0.5453846 0.00000000 0.005958436 0.00000000
## 0.28 0.5453846 0.00000000 0.005958436 0.00000000
## 0.29 0.5453846 0.00000000 0.005958436 0.00000000
## 0.30 0.5453846 0.00000000 0.005958436 0.00000000
## 0.31 0.5453846 0.00000000 0.005958436 0.00000000
## 0.32 0.5453846 0.00000000 0.005958436 0.00000000
## 0.33 0.5453846 0.00000000 0.005958436 0.00000000
## 0.34 0.5453846 0.00000000 0.005958436 0.00000000
## 0.35 0.5453846 0.00000000 0.005958436 0.00000000
## 0.36 0.5453846 0.00000000 0.005958436 0.00000000
## 0.37 0.5453846 0.00000000 0.005958436 0.00000000
## 0.38 0.5453846 0.00000000 0.005958436 0.00000000
## 0.39 0.5453846 0.00000000 0.005958436 0.00000000
## 0.40 0.5453846 0.00000000 0.005958436 0.00000000
## 0.41 0.5453846 0.00000000 0.005958436 0.00000000
## 0.42 0.5453846 0.00000000 0.005958436 0.00000000
## 0.43 0.5453846 0.00000000 0.005958436 0.00000000
## 0.44 0.5453846 0.00000000 0.005958436 0.00000000
## 0.45 0.5453846 0.00000000 0.005958436 0.00000000
## 0.46 0.5453846 0.00000000 0.005958436 0.00000000
## 0.47 0.5453846 0.00000000 0.005958436 0.00000000
## 0.48 0.5453846 0.00000000 0.005958436 0.00000000
## 0.49 0.5453846 0.00000000 0.005958436 0.00000000
## 0.50 0.5453846 0.00000000 0.005958436 0.00000000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.17.
We get a table describing the cross validation accuracy for different cp parameters.
cp parameter that was testedcp value.The accuracy starts lower, and then increases, and then will start decreasing again, as we saw in the example shown above.
plot(save_CV)
Let’s create a new CART model with this value of cp, instead of the minbucket parameter.
model_CART_CV <- rpart(Reverse ~ .,
data = Train,
method = "class",
cp = 0.17)
predict_CART_CV_Test <- predict(model_CART_CV, newdata = Test, type = "class")
cmat_CART_CV <- table(Test$Reverse, predict_CART_CV_Test)
cmat_CART_CV
## predict_CART_CV_Test
## 0 1
## 0 59 18
## 1 29 64
accu_CART_CV <- (cmat_CART_CV[1,1] + cmat_CART_CV[2,2])/sum(cmat_CART_CV)
Recall that our previous CART model had an accuracy of 65.9%
What does this decision tree look like?
prp(model_CART_CV)
Surprisingly, the best, cross validated, CART model achieves an accuracy of 72.4% with just one split based in the value of the
LowerCourt variable.
Cross validation helps us make sure we are selecting a good parameter value, and often this will significantly increase the accuracy.
If we had already happened to select a good parameter value, then the accuracy might not of increased that much. Nevertheless, by using cross validation, we can be sure that we are selecting a smart parameter value.
Their model had two stages of CART trees.
It turns out that about 50% of Supreme Court cases result in a unanimous decision, so the first stage alone was a nice first step to detect the easier cases.
This is the decision tree for Justice O’Connor:
And this is the decision tree for Justice Souter:
This shows an unusual property of the CART trees that Martin and his colleagues developed.
They use predictions for some trees as independent variables for other trees.
In this tree, the first split is whether or not Justice Ginsburg’s predicted decision is liberal. So we have to run Justice Ginsburg’s CART tree first, see what the prediction is, and then use that as input for Justice Souter’s tree.
If we predict that Justice Ginsburg will make a liberal decision, then Justice Souter will probably make a liberal decision too, and viceversa.
Martin and his colleagues recruited 83 legal experts:
For the 68 cases in October 2002, the predictions were made, and at the end of the month the results were computed.
For predicting the overall decision that was made by the Supreme Court,
So the models had a significant edge over the experts in predicting the overall case outcomes.
However, when the predictions were run for individual justices, the model and the experts performed very similarly, with an accuracy of about 68%.
For some justices, the model performed better, and for some justices, the experts performed better.