Three major data sets
stateCases_words.csv
: PDF files of 667 civil cases in Belknap County, BCDD in Concord, and Hillsborough NorthstateCases.csv
: spreadsheet variables on the same 667 casesThe words data came from 682 NH state cases scanned PDF files: 575 cases from Hillsborough North, 74 cases from BCDD (Concord), and 33 cases from Belknap County Superior Court.
Different variations of the filtering mechanism was tried. This analysis used the following filtering mechanism:
Note that idf (inverse document frequency) of 0 means that the word shows up in all documents: all 677 cases. Many legal jargon would surely have idf of 0 or a very low number. A higher idf as a filter, the less likely that legal jargons will be filtered out (a stricter standard) but at the risk of excluding potentially important words in the analysis. Another challenge in our analysis is to deal with a lot of pronouns (e.g., names of the defendant and the plaintiff) specific to each case that are not interesting. One way to filter pronouns is to set a minimum number of cases that the word shows up. Note that a low number of the minimum number of document requirements may cause many pairs of words with perfect correlation due to the presence of many pronouns (e.g., first and last name).
The merged data (data_word) has 174,581 rows, which is 20,000+ rows more than the stateCases_words (152,160 rows). This is because stateCases_spreadsheet has a few cases that show up in multiple rows (the row represents one case and type).
After merging state variables data with complaint pages, we have only half as many cases: 333 cases, not 667 in state variables data. A large number of cases from Belknap County Jurisdiction don’t have complaint pages.
The common words plot looks odd that many words show up in exactly the same number of cases. It happens because there are a relatively small number of cases (333 cases), while each case contains a large number of words.
It appears that ProSe is statistically significant predictor of judgment. When the defendant is prose, the odds of the plaintiff winning the case increases.
A lasso model was built to predict the plaintiff’s winning odds based on words in complaint pages. The model also controls for ProSe. The data exploration above shows significant relationships between the plaintiff’s winning odds and ProSe.
The plot shows that the optimal log of lambda is about -3. The first about 30 predictors gives us a model that balances simplicity and accuracy. It also shows that the first about 12 most influential predictors provides the simplest model but also lies within one standard error of the optimal value of lambda. The lasso model does this by discarding variables that have little predictive power and select those that are most influencial.
Log-odds Plot
Positive log-odds mean the probability of plaintiff winning greater than 50%, while negative log-odds are the probability of the pliaintiff winning less than 50%. For example, the use of proved in the complaint gives the probability of the plaintiff winning greater than 50%.
Probability plot
Positive probabilities indicate that the use of the word increases the probability of the plaintiff winning by that margin. For example, the use of proved in the complaint increases the probability of the plaintiff winning by about 16%.