The following report explores the contents of various unstructured text documents to be used in the development of a predictive text algorithm. The corpus of text documents to be used in training this predictive model include verbatim text sources from: blogs, news and twitter feeds. Specifically, three source files will be used:
3 Data Files:
Analyzing the frequency of words in each of the three sources, (taking a sample of 500 observations from each of the three data files), reveals similar word frequencies across each source. The top ten most frequent words include “the,” “and,” and “you,” among others. When combined, these common words represent approximately 17% of all words in the combined corpus.
Figure 1 - Histograms
Figure 2 - Histograms Excluding “stop words”
So called “stop words”, reflect words that are the most frequently used in the English language. To get a better understanding of the three corpuses used for the model, the following histograms were re-run with “stop words.” excluded.
An additional method of displaying the content of text corpuses is called “word clouds.” Word clouds graphically display text by randomly displaying words themselves with higher frequency words having greater font sizes in the graphic. The following word clouds reflect the same sources as in the above histograms (laid out in the order of the histograms in Fig. 2 above). The graphics give a better sense of the differences between the three data sources.
Figure 3 - Word Clouds (blogs, news, twitter, combined sources)
Figure 4 - Word Clouds Excluding “stop words” (blogs, news, twitter, combined sources)
Anf for an even clearer picture, the following word clouds are similar to those above in Fig. 3, but with “stop words” excluded.
Figure 5 - Word Clouds of “bigrams”" including/excluding “stop words”
Additionally, the following word clouds refelct “bigrams” or combinations of two-word counts that reveal the most common sequences of two-word patterns in the corpuses. The left clound contains all three combined data sources and includes all words with greater than 10 counts–including stop-words. The right is the same, but excludes stop-words (with 2+ word counts).
Plans for a Text Prediction Model
My plans for a text prediction model is to calculate markov chains (probability matrices) for varying ngram lengths of the three combined data sources. These probability matrices will be sourced in reactive–based input/output methods in Shiny.