Arranging words into a network is a common way of visualizing all the relationships among words simultaneously, rather than just the top few at a time. Pairs of consecutive words might capture structure that isn’t present when one is just counting single words, and may provide context that makes tokens more understandable (for example, “pulteney street”, in Northanger Abbey, is more informative than “pulteney”).
The graph below shows a network of bigrams (one word immediately followed by another) in six major books of Jane Austen. The arrow represents the order of bigrams and the thickness the frequency of the bigrams. It shows those that occurred more than 20 times and where neither word was a stop-word. The analysis can be extended to n-grams (e.g., trigrams).
Interpretation
Legal Risk Management Application
: This technique may be used to extract certain information from from court documents. For example, if court documents usually read “Plaintiff” immediately followed by the person’s name, say Plaintiff Daniel Lee, then the name of the plaintiff can be extracted by 1) tokenize the document by trigrams; 2) separate trigrams into three columns of words; 3) filter for “plaintiff”; and 4) taking word2 and word3 give us Daniel (word2) and Lee (word3).