Application of tf-idf in Legal Risk Management

The idea of tf-idf is to find the important words for the content of each document 1) by decreasing the weight for commonly used words and 2) increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Jane Austen’s novels as a whole.

Interpretation

What measuring tf-idf has done here is show us that Jane Austen used similar language across her six novels, and what distinguishes one novel from the rest within the collection of her works are the proper nouns, the names of people and places.
This is the point of tf-idf; it identifies words that are important to one document within a collection of documents.
Legal application: We could generate the same chart for legal cases. Note that each book has a graph in the current chart. Imagine that, instead of books, we can do this by the nature of suit. All the legal jargons that are common in all legal documents will be ranked low, while terms that appear freqent in a single nature of suit but not common in few other nature of suits will be ranked high. In other words, we could identify risk factors (frequent words) for the nature of suit. We can also break this down by the type of business. What increases the risk of, say, discrimination lawsuits for restaurants?

Application of tf-idf in Legal Risk Management

Daniel Lee

Feburary 18, 2018