Major findings

Future Research

Implort and clean data

Three major data sets

Import data

state case variables from spreadsheet data

  • The state variablesdata includes 667 civil cases in Belknap County, BCDD in Concord, and Hillsborough North.
  • A few cases have more than one case type. The data was manipulated so that one row has only one case type. As a result, a case can show up more than once when it has more than once case type.

word data from scanned PDF files

The words data came from 682 NH state cases scanned PDF files: 575 cases from Hillsborough North, 74 cases from BCDD (Concord), and 33 cases from Belknap County Superior Court.

Different variations of the filtering mechanism was tried. This analysis used the following filtering mechanism:

  • idf >= 2.5
  • words must appear in at least 10 cases
  • not stop_words from the tidytext package

Note that idf (inverse document frequency) of 0 means that the word shows up in all documents: all 677 cases. Many legal jargon would surely have idf of 0 or a very low number. A higher idf as a filter, the less likely that legal jargons will be filtered out (a stricter standard) but at the risk of excluding potentially important words in the analysis. Another challenge in our analysis is to deal with a lot of pronouns (e.g., names of the defendant and the plaintiff) specific to each case that are not interesting. One way to filter pronouns is to set a minimum number of cases that the word shows up. Note that a low number of the minimum number of document requirements may cause many pairs of words with perfect correlation due to the presence of many pronouns (e.g., first and last name).

Merge data

The merged data (data_word) has 174,581 rows, which is 20,000+ rows more than the stateCases_words (152,160 rows). This is because stateCases_spreadsheet has a few cases that show up in multiple rows (the row represents one case and type).

Explore data

After merging state variables data with complaint pages, we have only half as many cases: 333 cases, not 667 in state variables data. A large number of cases from Belknap County Jurisdiction don’t have complaint pages.

The most common words

The common words plot looks odd that many words show up in exactly the same number of cases. It happens because there are a relatively small number of cases (333 cases), while each case contains a large number of words.

Correlated word pairs

In order to reduce the legal jargons, I filtered out words with calculated inverse document frequency (IDF) of less than 2.5. An IDF of zero means that the word appears in all documents. Thus, words with a very low IDF are likely legal jargons. In addition, words were filtered so that only ones that appear at least 10 cases are included in the analysis.

The plot shows that the terms, “subordinating” and “indefeasible” always appeared together and the plaintiff has won the large percentage of cases whose complaint pages contain those two words. It’s also interesting to see a few pairs of words like “gallant” and “fradette” appear together and are associated with the plaintiff winning the case.

There are lots of pairs of words with perfect correlation. This is because complaint pages of each case contains a large number of words.

## # A tibble: 150 x 3
##    item1     item2        correlation
##    <fct>     <fct>              <dbl>
##  1 accesses  dissatisfied          1.
##  2 tickets   dissatisfied          1.
##  3 fradette  gallant               1.
##  4 gallant   fradette              1.
##  5 navy      corps                 1.
##  6 mceachern shaines               1.
##  7 benz      shaines               1.
##  8 mercedes  shaines               1.
##  9 motors    shaines               1.
## 10 maccallum shaines               1.
## # ... with 140 more rows

logit model to predict the odds of the plaintiff winning based ProSe

It appears that ProSe is statistically significant predictor of judgment. When the defendant is prose, the odds of the plaintiff winning the case increases.

Predicting the odds of the plaintiff winning the case based on words in complaint pages

A lasso model was built to predict the plaintiff’s winning odds based on words in complaint pages. The model also controls for ProSe. The data exploration above shows significant relationships between the plaintiff’s winning odds and ProSe.

The plot shows that the optimal log of lambda is about -3. The first about 30 predictors gives us a model that balances simplicity and accuracy. It also shows that the first about 12 most influential predictors provides the simplest model but also lies within one standard error of the optimal value of lambda. The lasso model does this by discarding variables that have little predictive power and select those that are most influencial.

Log-odds Plot

Positive log-odds mean the probability of plaintiff winning greater than 50%, while negative log-odds are the probability of the pliaintiff winning less than 50%. For example, the use of proved in the complaint gives the probability of the plaintiff winning greater than 50%.

Probability plot

Positive probabilities indicate that the use of the word increases the probability of the plaintiff winning by that margin. For example, the use of proved in the complaint increases the probability of the plaintiff winning by about 16%.

Appendix