The idea of tf-idf
is to find the important words for the content of each document 1) by decreasing the weight for commonly used words and 2) increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Jane Austen’s novels as a whole.
Interpretation
Legal application
: We could generate the same chart for legal cases. Note that each book has a graph in the current chart. Imagine that, instead of books, we can do this by the nature of suit. All the legal jargons that are common in all legal documents will be ranked low, while terms that appear freqent in a single nature of suit but not common in few other nature of suits will be ranked high. In other words, we could identify risk factors (frequent words) for the nature of suit. We can also break this down by the type of business. What increases the risk of, say, discrimination lawsuits for restaurants?