This document will share results from an exploratory analysis of sample text data that has been pulled from US News, various blogs and Twitter. This text has been selected as a predictor for a learning algorithm that will suggest next word choices as one keys text into a device. The results will reveal the nature of these data and reveal options for prediction vs performance design decisions.
| DiskSpace | LineCnt | AvgLineSize | |
|---|---|---|---|
| US News | 205,811,889 | 1,010,242 | 203.7253 |
| Blogs | 210,160,014 | 899,288 | 233.696 |
| 167,105,338 | 2,360,148 | 70.80291 | |
| TOTALS | 583,077,241 | 4,269,678 | 136.5623 |
With the three sources combined, The data consists of 214,386,000 words distributed among 4.3M lines and comprising .58G of disk space. Since this is very large for our computing power. The strategy for working within our resources is given below.
We will work with a 0.1% sample for building the corpus and its unique words, bigrams and trigrams dictionaries.
Ngram dictionaries will be prebuilt and written to disk so that the required processing power will only be needed for initial loads and/or changes to logic.
This logic will need to use prebuilt dictionaries that can be read by the machine where text is being typed.
Although the tm package includes functions for stemming and removing stop words, removing those would not provide a natural base for predicting how one speaks. The flow will seem more natural without this cleansing step.
For comparison, we show histograms in the left column that are “dirty” - meaning that removing stop words and stemming has not been executed. The “clean” plots on the right are after those two steps are completed.
Intuitively, the natural word progressions convey “cleansed” unigrams and “dirty” bi and tri - grams would make the best data features for predicting next words. For first words, one could build a dictionary of most frequently used words that sentences begin with.
To perform well, the prediction algorithm must address many obstacles, but two primary obstacles are:
Making word suggestions when entries don’t match anything in our ngram dictionaries
Producing useful results with limited machine resources.
The logic will not have the machine resources to be exhaustive so our logic will need to reach the best conclusions possible while economizing on dictionary disk space and processing power.
The first step when receiving text input will be to use the Chain Rule of Probability or Markov Chains. If this fails, we will fall back on smoothing.
For smoothing, we will use a back-off method where, for example, if a trigram isn’t found, the logic will look it up in the bigram dictionary, then look in the unigram dictionary. Lastly, if all has failed, our dictionaries will use a smoothing method that will suggest the best possible choices, possibly by applying a methodology that assigns non-zero probabilities to all of the words from the “dirty” unigram.
Knowing that the objective will be to predict the next word for text input activities using limited machine resources, we can see the importance of using right-sized data samples of precalculated dictionaries. In addition, efficient prediction logic that provides good answers in a reasonable amount of time will be important to the user. The perfect answer will likely take significantly more time to calculate while only producing marginally better results. We will opt for the better user experience.