The aim of this post is to analyze bench opinions written by the Supreme Court to answer the following questions:
* Is it possible to determine which Supreme Court justice wrote the court’s majority opinion for a particular case?
* Is it possible determine whether the majority opinion was written by a Supreme Court Justice with a liberal or conservative affiliation?

Adapted from “Automated Data Collection with R” Chapter 10, by Munzert, Rubba, Meibner, Nyhuis.

I want to use our text book as a guide for my first foray into statistical text processing.

Gathering the Opinions

All opinions came from the Cornell University Law School’s Legal Information Institute.

#library(RCurl)
#library(XML)
#library(stringr)
#library(tm)
#library(RTextTools)
#library(wordcloud)
#library(scales)

base_url <- "https://www.law.cornell.edu"

#create a handle so I can leave my information when I extract the urls
signatures = system.file("CurlSSL", caininfo = "cacert.pem", 
                         package = "RCurl")
handle = getCurlHandle(cainfo = signatures, 
                        httpheader = list(from="ncapofari@yahoo.com",
                        'user-agent' = str_c(R.version$version.string,
                        ", ", R.version$platform)))

#function to extract the html files of court issued opinions
get_opinions <- function(x, directory){
  #pass the function a judge's last name 
  #function will create html files on the local drive
  #each html file corresponds to a supreme court case 
  #where the judge wrote the majority opinion
  justice_url <- sprintf("/supct/author.php?%s#OPIN", x)
  url <- getURL(str_c(base_url, justice_url))
  #parse html doc
  parsed <- htmlParse(url)
  #find the useful links
  links_list <- xpathApply(parsed, "//ul")
  links <- getHTMLLinks(links_list[[10]])
  #don't need every decision ever so set limit at 100
  len <- 100
  #for justices without 100 
  if(length(links) < 100){len <- length(links)}
  #for each link
  for(i in 1:len){
    url <- str_c(base_url, links[i])
    #follow link and copy the html file
    tmp <- getURL(url, cainfo = signatures)
    #create html file with justice name and a number
    new_file <- str_c(directory, "/", x, i, ".html")
    write(tmp, new_file)
  }
}

#list of current supreme court justices
sc_justices <- c("Alito", "Breyer", "Ginsburg", "Kagan", "Kennedy", "Roberts", "Scalia", "Sotomayor", "Thomas")
retired_justices <- c("O'Connor", "Stevens", "Rehnquist", "Powell", "Blackmun", "Burger", "Marshall", "Fortas", "Goldberg", "White", "Stewart")

#create a directory
dir.create("SC_Opinions", showWarnings = FALSE)
if(length(list.files("SC_Opinions")) == 0){
  #get opinions for each justice
  for(i in 1:length(sc_justices)){
    get_opinions(sc_justices[i], "SC_Opinions")
  }
}

Creating a Corpus

Now that we have these opinions as .html files on our local drive, we can create a corpus that contains these opinions. I chose to save the .html files because I do not have full access to the internet where I work. (Not just because this is how the authors of the text book do it).

When I ran the program for the first time I realized that some of the links did not contain the court’s opinion. I followed one such link to see what was going on. After checking out this link and a few others, I realized that quite a few of the opinions were not in the links that are posted on the web site. I was still able to scrape enough files to construct a viable corpus. Next time I would like to not skip any cases so I would use a different site.

#helper function to determine a judge's affiliation
#these classifications are from www.insidegov.com
get_affiliation <- function(judge){
  libs <- c("Breyer", "Ginsburg", "Kagan", "Sotomayor")
  ifelse(judge %in% libs, return("liberal"), return("conservative"))
}

#1st document to create the corpus to store all opinions
tmp <- readLines(str_c("SC_Opinions/Alito1.html"))
tmp <- str_c(tmp, collapse = "")
tmp <- htmlParse(tmp)

#extract the opinion of the court for the case
opinion <- xpathSApply(tmp, "//p[@class='bodytext']", xmlValue)  
opinion <- paste(opinion, collapse = '')

#create corpus
opinion_corpus <- Corpus(VectorSource(opinion))
#create meta data
meta(opinion_corpus[[1]], "author") <- "Alito"
meta(opinion_corpus[[1]], "affiliation") <- get_affiliation("Alito")
meta(opinion_corpus[[1]], "id") <- "Alito1.html"

#function to create a corpus from the html files
create_corpus <- function(corpus, directory){
  directory <- str_c(directory, "/")
  n <- 2
  #for each file in the directory
  for(i in n:length(list.files(directory))){
    #extract judges name from file
    justice <- str_extract(list.files(directory)[i], "[a-z|A-Z]*")
    #parse the file
    tmp <- readLines(str_c(directory,
                           list.files(directory)[i]))
    tmp <- str_c(tmp, collapse="")
    tmp <- htmlParse(tmp)
    #extract opinion
    opinion <- xpathSApply(tmp, "//p[@class='bodytext']", xmlValue)
    opinion <- paste(opinion, collapse = '')
    #I did not realize til later but a good amount of the links 
    #do not have any information in them
    if(opinion == "") { next }
    if(opinion != "") {
      #if an opinion was generated create a temp corpus
      tmp_corpus <- Corpus(VectorSource(opinion))
      #combine the temp with the real corpus
      corpus <- c(corpus, tmp_corpus)
      #set meta data
      meta(corpus[[n]], "author") <- justice
      meta(corpus[[n]], "affiliation") <-
        get_affiliation(justice)
      meta(corpus[[n]], "id") <- list.files(directory)[i]
      n <- n + 1
    }
  }
  return(corpus)
} 
  
opinion_corpus <- create_corpus(opinion_corpus, "SC_Opinions") 

Corpus Summary

These tables represent the number of case where each judge wrote the majority opinion. The document for each case includes the syllabus, the majority opinion, and all dissents and consents.

table(as.character(meta(opinion_corpus, "author")))
## 
##     Alito    Breyer  Ginsburg     Kagan   Kennedy   Roberts    Scalia Sotomayor    Thomas 
##        52        53        40         5        41         8        34        18        45
table(as.character(meta(opinion_corpus, "affiliation")))
## 
## conservative      liberal 
##          180          116

These are the words that appear most when a liberal justice writes the majority opinion.

#function to create a a term document matrix from a corpus
create_tdm <- function(a_corpus){
  tdm <- TermDocumentMatrix(a_corpus, 
                            control = 
                              list(removePunctuation=TRUE, 
                                   removeNumbers=TRUE, 
                                   stopwords=TRUE))
  tdm <- removeSparseTerms(tdm, 0.99)
  return(tdm)
}

#function to create a word cloud from a tdm
my_wordcloud <- function(tdm){
  m <- as.matrix(tdm)
  v <- sort(rowSums(m), decreasing=TRUE)
  df <- data.frame(word = names(v), freq = v)
  df$prop <- df$freq/sum(df$freq)
  #remove all terms with unicode
  df$word <- str_replace_all(df$word, "[^[a-z|\\s]]", "")
  set.seed(1234)
  wordcloud(words = df$word, freq = df$freq, 
            min.freq = 1000, max.words = 150,
            random.order = FALSE, rot.per = 0.35,
            colors = brewer.pal(8, "Dark2"),)
  return(df)
}
lib_index <- meta(opinion_corpus, "affiliation") == "liberal"
liberal <- my_wordcloud(create_tdm(opinion_corpus[lib_index]))

liberal$affiliation = "liberal"
head(liberal)
##            word  freq        prop affiliation
## court     court 26241 0.015536292     liberal
## see         see 16135 0.009552916     liberal
## state     state 12964 0.007675489     liberal
## federal federal 11533 0.006828248     liberal
## law         law  8448 0.005001738     liberal
## case       case  8059 0.004771426     liberal

These are the words that appear most when a conservative justice writes the majority opinion.

con_index <- meta(opinion_corpus, "affiliation") == "conservative"
conservative <- my_wordcloud(create_tdm(opinion_corpus[con_index]))

conservative$affiliation = "conservative"
head(conservative)
##          word  freq        prop  affiliation
## court   court 35426 0.012408745 conservative
## see       see 26753 0.009370834 conservative
## states states 13073 0.004579109 conservative
## state   state 12951 0.004536376 conservative
## may       may 12075 0.004229538 conservative
## case     case 11149 0.003905186 conservative

The term frequencies seem similar for each subset of the corpus. But what if we compared them to each other in one word cloud…

#this is a way to retrieve the text from a corpus
l <- data.frame(text=unlist(sapply(opinion_corpus[lib_index], `[`, "content")), stringsAsFactors=F)
c <- data.frame(text=unlist(sapply(opinion_corpus[con_index], `[`, "content")), stringsAsFactors=F)

#clean the text up
all_lib_ops <- str_replace_all(l[1,], "[^[A-z|\\s]]", "")
all_con_ops <- str_replace_all(c[1,], "[^[A-z|\\s]]", "")

#create a new 2 document corpus
corp <- Corpus(VectorSource(c(all_lib_ops, all_con_ops)))
new_tdm <- create_tdm(corp)
new_tdm <- as.matrix(new_tdm)
colnames(new_tdm) <- c("liberal", "conservative")

comparison.cloud(new_tdm, random.order=FALSE, 
  colors = c("#00B2FF", "red", "#FF0099", "#6600CC"),
  title.size=1.5, max.words=80)

Creating a Document Term Matrix

#function takes a corpus and returns a document term matrix
create_dtm <- function(corpus){
  #use only lower case letters
  opin_corpus <- tm_map(corpus, tolower)
  #reset the documents to perform more functions
  #I found this helped when using functions in tm_map
  #that are not part of the tm package
  opin_corpus <- tm_map(opin_corpus, PlainTextDocument)
  #remove all punctuation and numbers
  opin_corpus <- tm_map(opin_corpus, 
                        str_replace_all, "[^[a-z|\\s]]", "")
  #remove the names of any justices
  #it would be too easy if we left them in
  #remove stop words too
  #and types of justices
  opin_corpus <- tm_map(opin_corpus, 
                        removeWords, 
                        words = c(tolower(sc_justices), 
                                  tolower(retired_justices),
                                  stopwords("english"),
                                  c("chief", "junior", "associate")))
  #reset the document...seems to help text removal
  opin_corpus <- tm_map(opin_corpus, PlainTextDocument)
  #use only word stems
  opin_corpus <- tm_map(opin_corpus, stemDocument)
  #create a document term matrix
  dtm <- DocumentTermMatrix(opin_corpus)
  #remove words that appear in 3 documents or less
  dtm <- removeSparseTerms(dtm, 1-(3/length(corpus)))
  return(dtm)
}

dtm <- create_dtm(opinion_corpus)
#save copy of all terms
write(unlist(dtm$dimnames[2]), "dtm.txt")

Supervised Learning Techniques

Using the RTextTools package we will train the program to determine the majority opinion’s author and their affiliation.

n = nrow(dtm)
#random sample to use as training data
train = sort(sample(1:n, n*.8))
#test the rest
test = sort(setdiff(1:n, train))

#see if we can detect the majority opinion author
op_list <- unlist(meta(opinion_corpus, "author"))
#create a container to store our information
container <- create_container(dtm,
                              labels = op_list,
                              trainSize = train,
                              testSize = test,
                              virgin = FALSE)
#test these 3 models
svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")

svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)

#create a data frame to view the results
comparisons <- data.frame(
  id = unlist(meta(opinion_corpus[test], "id")),
  correct_author = op_list[test],
  svm_author = as.character(svm_out[,1]),
  svm_prob = as.character(svm_out[,2]),
  tree_author = as.character(tree_out[,1]),
  tree_prob = as.character(tree_out[,2]),
  maxent_author = as.character(maxent_out[,1]),
  maxent_prob = as.character(maxent_out[,2]),
  stringsAsFactors = FALSE)
head(comparisons)
##                        id correct_author svm_author          svm_prob tree_author         tree_prob maxent_author
## Alito17.html Alito17.html          Alito      Alito 0.199225958066123    Ginsburg                 1        Breyer
## Alito2.html   Alito2.html          Alito      Alito 0.199225958066123      Breyer             0.875        Thomas
## Alito27.html Alito27.html          Alito      Alito 0.199225958066123       Alito 0.333333333333333       Roberts
## Alito36.html Alito36.html          Alito      Alito 0.201691163073076   Sotomayor                 1     Sotomayor
## Alito4.html   Alito4.html          Alito     Thomas 0.232570363324938      Breyer 0.642857142857143        Breyer
## Alito40.html Alito40.html          Alito      Alito 0.199225958066123       Alito 0.666666666666667        Breyer
##                    maxent_prob
## Alito17.html 0.999999991904855
## Alito2.html                  1
## Alito27.html                 1
## Alito36.html  0.99995391225811
## Alito4.html  0.690095054279922
## Alito40.html                 1

Now let’s take a look at our affiliation training.

#see if we can detect the party affiliation of the author
op_list_i <- unlist(meta(opinion_corpus, "affiliation"))
#create a container to store our information
container_i <- create_container(dtm,
                              labels = op_list_i,
                              trainSize = train,
                              testSize = test,
                              virgin = FALSE)
#test these 3 models
svm_model_i <- train_model(container_i, "SVM")
tree_model_i <- train_model(container_i, "TREE")
maxent_model_i <- train_model(container_i, "MAXENT")

svm_out_i <- classify_model(container_i, svm_model_i)
tree_out_i <- classify_model(container_i, tree_model_i)
maxent_out_i <- classify_model(container_i, maxent_model_i)

#create a data frame to view the results
comparisons_i <- data.frame(
  id = unlist(meta(opinion_corpus[test], "id")),
  correct_affiliation = op_list_i[test],
  svm_affiliation = as.character(svm_out_i[,1]),
  svm_prob = as.character(svm_out_i[,2]),
  tree_affiliation = as.character(tree_out_i[,1]),
  tree_prob = as.character(tree_out_i[,2]),
  maxent_affiliation = as.character(maxent_out_i[,1]),
  maxent_prob = as.character(maxent_out_i[,2]),
  stringsAsFactors = FALSE)
head(comparisons_i)
##                        id correct_affiliation svm_affiliation          svm_prob tree_affiliation         tree_prob
## Alito17.html Alito17.html        conservative    conservative 0.624443154175027     conservative                 1
## Alito2.html   Alito2.html        conservative    conservative 0.624443154175027     conservative 0.928571428571429
## Alito27.html Alito27.html        conservative    conservative 0.624443154175027     conservative                 1
## Alito36.html Alito36.html        conservative    conservative 0.620395172266471     conservative                 1
## Alito4.html   Alito4.html        conservative    conservative 0.565956134732531     conservative                 1
## Alito40.html Alito40.html        conservative    conservative 0.624443154175027     conservative 0.928571428571429
##              maxent_affiliation maxent_prob
## Alito17.html            liberal           1
## Alito2.html        conservative           1
## Alito27.html       conservative           1
## Alito36.html            liberal           1
## Alito4.html        conservative           1
## Alito40.html            liberal           1

Summary of Outcomes

final_df <- cbind(comparisons, comparisons_i)
totals <- data.frame(rbind(
  table(final_df$correct_author == final_df$svm_author),
  table(final_df$correct_author == final_df$tree_author),
  table(final_df$correct_author == final_df$maxent_author),
  table(final_df$correct_affiliation == final_df$svm_affiliation),
  table(final_df$correct_affiliation == final_df$tree_affiliation),
  table(final_df$correct_affiliation == final_df$maxent_affiliation)
))
totals$model <- c("svm_author", "tree_author", "maxent_author",
                  "svm_affiliation", "tree_affiliation", "maxent_affiliation")
totals$percent_correct <- percent(totals$TRUE./(totals$FALSE.+totals$TRUE.))

totals
##   FALSE. TRUE.              model percent_correct
## 1     39    21         svm_author           35.0%
## 2     43    17        tree_author           28.3%
## 3     39    21      maxent_author           35.0%
## 4     19    41    svm_affiliation           68.3%
## 5     19    41   tree_affiliation           68.3%
## 6     23    37 maxent_affiliation           61.7%

Possible Applications

I was not able to generate a good prediction model to determine the majority opinion writer of a Supreme Court decision. There are plenty of reasons why I failed. The most obvious reason, and the most glaring error I committed, is that I included the entire slip opinion, instead of focusing upon just the majority opinion of each case. Even though the models were better than just a random guess, if I focused upon just the majority opinion the results would be better.

The affiliation models created more accurate predictions. With severe enhancements, these models could be used to predict the affiliation of future nominees of the Supreme Court. Even though a Justice may be nominated by a conservative or liberal Congress, that is not necessarily a bellwether of future decision making. John Paul Stevens was a registered Republican when he was nominated to the Court, but by the end of his tenure he was widely considered to be on the liberal side of the Court. If a nominees’ previous writings are evaluated, these models may be beneficial at establishing a firm understanding of the candidate’s decision making.