Analytics Edge: Unit 4 - An Introduction to Trees

Judge, Jury, and Classifier

The American Legal System

The legal system of the United States operates at the state level and at the federal level
Federal courts hear cases beyond the scope of state law
Federal courts are divided into:
- District Courts
  - Makes initial decision
- Circuit Courts
  - Hears appeals from the district courts
- Supreme Court
  - Highest level - makes final decision

The Supreme Court of the United States

Consists of nine judges (“justices”), appointed by the President
- Justices are distinguished judges, professors of law, state and federal attorneys
The Supreme Court of the United States (SCOTUS) decides on most difficult and controversial cases
- Often involve interpretation of the Constitution
- Significant social, political and economic consequences

Notable SCOTUS Decisions

Wickard v. Filbum (1942)
- Congress allowed to intervene in industrial/economic activity
Roe v. Wade (1973)
- Legalized abortion
Bush v. Gore (2000)
- Decided outcome of presidential election
National Federation of Independent Business v. Sebelius (2012)
- Patient Protection and Affordable Care Act (“ObamaCare”) upheld the requirement that individuals must buy health insurance

Predicting Supreme Court Cases

Legal academics and political scientists regularly make predictions of SCOTUS decisions from detailed studies of cases and individual justices
In 2002, Andrew Martin, a professor of political science at Washington University in St. Louis, decided to instead predict decisions using a statistical model built from data
Together with his colleagues, he decided to test this model against a panel of experts
Martin used a method called Classification and Regression Trees (CART)
Why not logistic regression?
- Logistic regression models are generally not intepretable
- Model coefficients indicate importance and relative effect of variables, but do not give a simple explanation of how decision is made

Data

Cases from 1994 through 2001
In this period, same nine justices presided SCOTUS
- Breyer, Ginsburg, Kennedy, O’Connor, Rehnquist (Chief Justice), Scalia, Souter, Stevens, Thomas
- Rare data set - longest period of time with the same set of justices in over 180 years
We will focus on predicting Justice Stevens’ decisions
- Started out moderate, but became more liberal
- Self-proclaimed conservative

Variables

Dependent Variable: Did Justice Stevens vote to reverse the lower court decision? 1 = reverse, 0 = affirm
Indepedent Variable: Properties of the case
- Circuit court of origin
- Issue area of case
- Type of petitioner, type of respondent
- Ideological direction of lower court decision
- Whether petitioner argued that a law/practice was unconstitutional

Logistic Regression for Justice Stevens

Some significant variables and their coefficients
- Case is from 2^nd circuit court: +1.66
- Case is from 4^th circuit court: +2.82
- Lower court decision is liberal: -1.22
This is complicated
- Difficult to understand which factors are important
- Difficult to quickly evaluate what prediction is for a new case

Classification and Regression Trees

Build a tree by splitting on variables
To predict the outcome for an observation, follow the splits and at the end, predict the most frequent outcome
Does not assume a linear model
Interpret able

Splits in CART

Final Tree

When Does CART Stop Splitting?

There are different ways to control how many splits are generated
- One way is by setting a lower bound for the number of points in each subset
In R, a parameter that control this is minibucket
- The smaller it is, the more splits will be generated
- If it is too small, overfitting will occur
- If it is too large, model will be too simple and accuracy will be poor

Predictions from CART

In each subset, we have a bucket of observations, which may contain both outcomes (i.e. affirm and reverse)
Compute the percentage of data in a subset of each type
- Example: 10 affirm, 2 reverse -> 10/(10+2) = 0.87
Just like in logistic regression, we can threshold to obtain a prediction
- Threshold of 0.5 corresponds to picking most frequent outcome

ROC curve for CART

Vary the threshold to obtain an ROC curve

Random Forests

Designed to improve prediction accuracy of CART
Works by building a large number of CART trees
- Makes model less interpret able
To make a prediction for a new observation, each tree “votes” on the outcome, and we pick the outcome that receives the majority of the votes

Building Many Trees

Each tree can split on only a random subset of the variables
Each tree is built from a “bagged” /“bootstrapped” sample of the data
- Select observations randomly with replacement
- Example - original data: 1 2 3 4 5
- New “data”:
2 4 5 2 1 -> 1^st tree

3 5 1 5 2 -> 2^nd tree

Random Forest Parameters

Minimum number of observations in a subset
- In R, this is controlled by the moderate parameter
- Smaller nodesize may take longer in R
Number of trees
- In R, this is the ntree parameter
- Should not be too small, because bagging procedure may missing observations
- More tree take longer to build

Parameter Selection

In CART, the value of “minibucket” can affect the model’s out-of-sample accuracy
How should we set this parameter?
We could select the value that gives the best testing set accuracy
- This is not right!

K-fold Cross-Validation

Given training set, split into k pieces
Use k-1 folds to estimate a model and test model on remaining on fold (“validation set”) for each candidate parameter value
Repeat for each of the k folds

Output of k-fold Cross-Validation

Cross-Validation in R

Before, we limited our tree using minibucket
When we use cross-validation in R, we’ll use a parameter called cp instead
- Complexity Parameter
Like Adjusted R² and AIC
- Measures trade-off between model complexity and accuracy on the training set
Smaller cp leads to a bigger tree (might overfit)

Martins Model

Used 628 previous SCOTUS cases between 1994 and 2001
Made predictions for the 68 cases that would be decided in October 2002, before the term started
Two stage approach based on CART:
- First stage: one tree to predict a unanimous liberal decision, other tree to predict unanimous conservative decision.
  - If conflicting predictions or predict no, move to next stage
- Second stage: predict decisions of each individual justice, and using majority decision as prediction

Example Trees of Justices

The Experts

Martin and his colleagues recruited 83 legal experts
- 71 academics and 12 attorneys
- 38 previously clerked for a Supreme Court justice, 33 were chaired professors and 5 were current or former law school deans
Experts only asked to predict within their area of expertise; more than one expert to each case
Allowed to consider any source of information, but not allowed to communicate with each other regarding predictions

The Results

For the 68 cases in October 2002:
Overall case predictions:
- Model accuracy: 75%
- Experts accuracy: 59%
Individual justice predictions:
- Model accuracy: 67%
- Experts accuracy: 68%

The Analytics Edge

Predicting Supreme Court decisions is very valuable to firms, politicians and non-governmental organizations
A model that predicts these decisions is both more accurate and faster than experts
- CART model based on very high-level details of case beats experts who can process much more detailed and complex information

Judge, Jury, and Classifier in R

Read in the data

# Read in the data
stevens = read.csv("stevens.csv")
# Output structure
str(stevens)
## 'data.frame':    566 obs. of  9 variables:
##  $ Docket    : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
##  $ Term      : int  1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
##  $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
##  $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
##  $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...

Split the data

# Split the data
library(caTools)
set.seed(3000)
spl = sample.split(stevens$Reverse, SplitRatio = 0.7)
Train = subset(stevens, spl==TRUE)
Test = subset(stevens, spl==FALSE)

Load CART tree packages

# Load CART tree packages
library(rpart)
library(rpart.plot)

Implement CART Model

# CART model
StevensTree = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method="class", minbucket=25)
# Plot CART tree
prp(StevensTree)

Make predictions

# Make predictions
PredictCART = predict(StevensTree, newdata = Test, type = "class")
z = table(Test$Reverse, PredictCART)
kable(z)

	0	1
0	41	36
1	22	71

# Compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.6588235

ROC Curve

# ROC curve
library(ROCR)
# Make predictions on test set
PredictROC = predict(StevensTree, newdata = Test)
PredictROC
##             0         1
## 1   0.3035714 0.6964286
## 3   0.3035714 0.6964286
## 4   0.4000000 0.6000000
## 6   0.4000000 0.6000000
## 8   0.4000000 0.6000000
## 21  0.3035714 0.6964286
## 32  0.5517241 0.4482759
## 36  0.5517241 0.4482759
## 40  0.3035714 0.6964286
## 42  0.5517241 0.4482759
## 46  0.5517241 0.4482759
## 47  0.4000000 0.6000000
## 53  0.5517241 0.4482759
## 55  0.3035714 0.6964286
## 59  0.1842105 0.8157895
## 60  0.4000000 0.6000000
## 66  0.4000000 0.6000000
## 67  0.4000000 0.6000000
## 68  0.1842105 0.8157895
## 72  0.3035714 0.6964286
## 79  0.3035714 0.6964286
## 80  0.5517241 0.4482759
## 87  0.7600000 0.2400000
## 88  0.1842105 0.8157895
## 92  0.7910448 0.2089552
## 95  0.7910448 0.2089552
## 102 0.7910448 0.2089552
## 106 0.7910448 0.2089552
## 110 0.7910448 0.2089552
## 112 0.7910448 0.2089552
## 114 0.7910448 0.2089552
## 125 0.7910448 0.2089552
## 130 0.7910448 0.2089552
## 134 0.7910448 0.2089552
## 138 0.7910448 0.2089552
## 145 0.7910448 0.2089552
## 146 0.7910448 0.2089552
## 148 0.3035714 0.6964286
## 149 0.3035714 0.6964286
## 152 0.3035714 0.6964286
## 154 0.5517241 0.4482759
## 161 0.7878788 0.2121212
## 164 0.4000000 0.6000000
## 167 0.7878788 0.2121212
## 169 0.3035714 0.6964286
## 171 0.7600000 0.2400000
## 175 0.5517241 0.4482759
## 176 0.0754717 0.9245283
## 177 0.0754717 0.9245283
## 178 0.0754717 0.9245283
## 180 0.0754717 0.9245283
## 187 0.0754717 0.9245283
## 188 0.7878788 0.2121212
## 190 0.0754717 0.9245283
## 192 0.0754717 0.9245283
## 196 0.0754717 0.9245283
## 197 0.3035714 0.6964286
## 208 0.3035714 0.6964286
## 210 0.0754717 0.9245283
## 216 0.7910448 0.2089552
## 218 0.7910448 0.2089552
## 220 0.0754717 0.9245283
## 224 0.4000000 0.6000000
## 226 0.7600000 0.2400000
## 227 0.4000000 0.6000000
## 228 0.7878788 0.2121212
## 235 0.3035714 0.6964286
## 239 0.7878788 0.2121212
## 242 0.7600000 0.2400000
## 244 0.7600000 0.2400000
## 247 0.4000000 0.6000000
## 255 0.3035714 0.6964286
## 260 0.5517241 0.4482759
## 261 0.7600000 0.2400000
## 264 0.3035714 0.6964286
## 265 0.3035714 0.6964286
## 268 0.3035714 0.6964286
## 272 0.5517241 0.4482759
## 273 0.3035714 0.6964286
## 274 0.5517241 0.4482759
## 275 0.3035714 0.6964286
## 282 0.4000000 0.6000000
## 286 0.7878788 0.2121212
## 291 0.4000000 0.6000000
## 294 0.1842105 0.8157895
## 305 0.4000000 0.6000000
## 306 0.3035714 0.6964286
## 308 0.7878788 0.2121212
## 311 0.7878788 0.2121212
## 313 0.7878788 0.2121212
## 314 0.7878788 0.2121212
## 315 0.7878788 0.2121212
## 317 0.7878788 0.2121212
## 320 0.7878788 0.2121212
## 321 0.7878788 0.2121212
## 323 0.4000000 0.6000000
## 331 0.3035714 0.6964286
## 335 0.3035714 0.6964286
## 338 0.7600000 0.2400000
## 341 0.5517241 0.4482759
## 345 0.5517241 0.4482759
## 346 0.3035714 0.6964286
## 350 0.3035714 0.6964286
## 352 0.3035714 0.6964286
## 353 0.1842105 0.8157895
## 355 0.3035714 0.6964286
## 356 0.1842105 0.8157895
## 358 0.3035714 0.6964286
## 359 0.3035714 0.6964286
## 360 0.4000000 0.6000000
## 361 0.4000000 0.6000000
## 362 0.5517241 0.4482759
## 364 0.3035714 0.6964286
## 368 0.3035714 0.6964286
## 381 0.4000000 0.6000000
## 382 0.1842105 0.8157895
## 384 0.3035714 0.6964286
## 387 0.1842105 0.8157895
## 389 0.3035714 0.6964286
## 390 0.4000000 0.6000000
## 394 0.3035714 0.6964286
## 400 0.7878788 0.2121212
## 402 0.4000000 0.6000000
## 405 0.7878788 0.2121212
## 408 0.3035714 0.6964286
## 410 0.3035714 0.6964286
## 416 0.4000000 0.6000000
## 422 0.7600000 0.2400000
## 432 0.0754717 0.9245283
## 434 0.7910448 0.2089552
## 436 0.0754717 0.9245283
## 441 0.7910448 0.2089552
## 444 0.0754717 0.9245283
## 448 0.0754717 0.9245283
## 450 0.0754717 0.9245283
## 451 0.0754717 0.9245283
## 452 0.7910448 0.2089552
## 454 0.0754717 0.9245283
## 456 0.0754717 0.9245283
## 459 0.0754717 0.9245283
## 462 0.0754717 0.9245283
## 464 0.0754717 0.9245283
## 467 0.0754717 0.9245283
## 468 0.0754717 0.9245283
## 470 0.0754717 0.9245283
## 473 0.0754717 0.9245283
## 476 0.0754717 0.9245283
## 478 0.0754717 0.9245283
## 480 0.0754717 0.9245283
## 482 0.0754717 0.9245283
## 483 0.0754717 0.9245283
## 484 0.0754717 0.9245283
## 494 0.7910448 0.2089552
## 498 0.1842105 0.8157895
## 504 0.4000000 0.6000000
## 509 0.4000000 0.6000000
## 521 0.7600000 0.2400000
## 527 0.4000000 0.6000000
## 531 0.4000000 0.6000000
## 535 0.4000000 0.6000000
## 538 0.7600000 0.2400000
## 539 0.1842105 0.8157895
## 540 0.4000000 0.6000000
## 543 0.7600000 0.2400000
## 545 0.4000000 0.6000000
## 546 0.7910448 0.2089552
## 551 0.7910448 0.2089552
## 552 0.7910448 0.2089552
## 556 0.4000000 0.6000000
## 558 0.1842105 0.8157895
# Plot ROC curve
pred = prediction(PredictROC[,2], Test$Reverse)
perf = performance(pred, "tpr", "fpr")
plot(perf)

Load randomForest package

# Load randomForest package
library(randomForest)

Implement random forest model

# Build random forest model
StevensForest = randomForest(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, ntree=200, nodesize=25 )

# Convert outcome to factor
Train$Reverse = as.factor(Train$Reverse)
Test$Reverse = as.factor(Test$Reverse)

Implement random forest model2

# Try again
StevensForest = randomForest(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, ntree=200, nodesize=25 )

# Make predictions
PredictForest = predict(StevensForest, newdata = Test)
# Compute Accuracy
z = table(Test$Reverse, PredictForest)
kable(z)

	0	1
0	42	35
1	18	75

sum(diag(z))/sum(z)
## [1] 0.6882353

Cross - Validation

# Load cross-validation packages
library(caret)
library(e1071)

# Define cross-validation experiment
numFolds = trainControl( method = "cv", number = 10 )
cpGrid = expand.grid( .cp = seq(0.01,0.5,0.01)) 

# Perform the cross validation
train(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method = "rpart", trControl = numFolds, tuneGrid = cpGrid )
## CART 
## 
## 396 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 356, 356, 357, 356, 356, 357, ... 
## Resampling results across tuning parameters:
## 
##   cp    Accuracy   Kappa       
##   0.01  0.6087821   0.189707219
##   0.02  0.6216667   0.223453071
##   0.03  0.6267949   0.239192228
##   0.04  0.6368590   0.266178297
##   0.05  0.6443590   0.283030759
##   0.06  0.6443590   0.283030759
##   0.07  0.6443590   0.283030759
##   0.08  0.6443590   0.283030759
##   0.09  0.6443590   0.283030759
##   0.10  0.6443590   0.283030759
##   0.11  0.6443590   0.283030759
##   0.12  0.6443590   0.283030759
##   0.13  0.6443590   0.283030759
##   0.14  0.6443590   0.283030759
##   0.15  0.6443590   0.283030759
##   0.16  0.6443590   0.283030759
##   0.17  0.6443590   0.283030759
##   0.18  0.6443590   0.283030759
##   0.19  0.6443590   0.283030759
##   0.20  0.6038462   0.185123111
##   0.21  0.5631410   0.078289037
##   0.22  0.5528846   0.051089037
##   0.23  0.5403846   0.004897294
##   0.24  0.5378846  -0.008808290
##   0.25  0.5378846  -0.008808290
##   0.26  0.5453846   0.000000000
##   0.27  0.5453846   0.000000000
##   0.28  0.5453846   0.000000000
##   0.29  0.5453846   0.000000000
##   0.30  0.5453846   0.000000000
##   0.31  0.5453846   0.000000000
##   0.32  0.5453846   0.000000000
##   0.33  0.5453846   0.000000000
##   0.34  0.5453846   0.000000000
##   0.35  0.5453846   0.000000000
##   0.36  0.5453846   0.000000000
##   0.37  0.5453846   0.000000000
##   0.38  0.5453846   0.000000000
##   0.39  0.5453846   0.000000000
##   0.40  0.5453846   0.000000000
##   0.41  0.5453846   0.000000000
##   0.42  0.5453846   0.000000000
##   0.43  0.5453846   0.000000000
##   0.44  0.5453846   0.000000000
##   0.45  0.5453846   0.000000000
##   0.46  0.5453846   0.000000000
##   0.47  0.5453846   0.000000000
##   0.48  0.5453846   0.000000000
##   0.49  0.5453846   0.000000000
##   0.50  0.5453846   0.000000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.19.

# Create a new CART model
StevensTreeCV = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method="class", cp = 0.18)

# Make predictions
PredictCV = predict(StevensTreeCV, newdata = Test, type = "class")
z = table(Test$Reverse, PredictCV)
kable(z)

	0	1
0	59	18
1	29	64

# Compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.7235294

Analytics Edge: Unit 4 - An Introduction to Trees

Sulman Khan

October 26, 2018

Judge, Jury, and Classifier

The American Legal System

The Supreme Court of the United States

Notable SCOTUS Decisions

Predicting Supreme Court Cases

Data

Variables

Logistic Regression for Justice Stevens

Classification and Regression Trees

Splits in CART

Final Tree

When Does CART Stop Splitting?

Predictions from CART

ROC curve for CART

Random Forests

Building Many Trees

Random Forest Parameters

Parameter Selection

K-fold Cross-Validation

Output of k-fold Cross-Validation

Cross-Validation in R

Martins Model

Example Trees of Justices

The Experts

The Results

The Analytics Edge

Judge, Jury, and Classifier in R

Read in the data

Split the data

Load CART tree packages

Implement CART Model

Make predictions

ROC Curve

Load randomForest package

Implement random forest model

Implement random forest model2

Cross - Validation