Summary: The intent of this initial report is to convey findings from initial exploration of the dataset for the JHU/Coursera Capstone. The motive for exploration is to lay the groundwork for subsequent development of a language model that will be reasonably efficient and accurate at predicting the next word in a string, while conserving system resources. Based on this initial exploration, it appears that performance and storage advantages may be optimized using a hybrid predictive language model. That model will draw on patterns matched in frequent n-grams from the training set, part-of-speech tagging for generalized pattern matching in cases of infrequent or unobserved word combinations, and leveraging a system’s installed spelling correction services.
Acquiring and reading the data: A couple of warnings are returned when reading the twitter file. Analysis of the source files found that these lines are truncated, but the handful of errors is quite small, given the size of the corpus, and should have no significant effect on the final language model.
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
At an overview, the corpus break down is shown below. For the purpose of measuring the corpus, word counts exclude punctuation and digits, although the language model’s ultimate filter, if any, will need a more nuanced approach, especially if emoticons and SMS expressions r2b considered as units of meaning worth recommending ;-).
##
##
## |source | items| words| avg words|
## |:-------|-------:|--------:|---------:|
## |twitter | 2360148| 29370686| 12.44|
## |blogs | 899288| 36817646| 40.94|
## |news | 1010242| 1010242| 33.14|
Viewed as word clouds of sampled texts, differences are evident among the sources. Represented below from left to right are entries from blogs, Twitter, and news. The smaller, more uniform font size in the news cloud, suggest a greater variety of words, with less repetition.
## Loading required package: RColorBrewer
## Loading required package: tm
## Loading required package: NLP
Before thinking further about which language models will be helpful in predicting the next word in an input string, I should summarize the states the input may be in at various points. For the purpose of this project, the following states are understood.
| State | Description | Explanation |
|---|---|---|
| Zero state | The context-free state | This state exists at the beginning of text entry, or after only non-alphabetic input has been made. A user entering only an emoticon or phone number would remain in this state. |
| First partial word | A minimal context state | The user has begun to enter text, but has not reached the first word boundary. In this state, results from a spell-checker with recommendation engine may be combined with pattern matching against uni-grams (unique words from the training set) for next word prediction. It can be assumed that any recent smartphone or other platform will have an existing spelling function that should be accessible to the application. This should allow us to get some additional predictive performance, without adding to the storage requirement. |
| Single full word | A minimal context state | The user has entered one word followed by a word boundary (space or certain punctuation), but has not begun a second word. In this state, my only guide to prediction is observed next-word frequency from the set of two-word combinations in the training data (bi-grams). |
| Subsequent full word | n-gram context state | n full words have been entered and the user is at a word boundary. Multiple models may be combined at this point. If a high-frequency next word is suggested by 3- or 4-grams in the training model, it is an obvious choice. However, there is enough context to begin to suggest next words that were not in the training model. The choice of models may be governed by the active language. For example, the English language allows a predictive model to suggest articles after prepositions with fairly high probability. It could also apply stemming or possibly an implementation Levenshtein Distance or other text similarity metrics. The choice will need to balance performance against speed and resource requirements. As discussed below part-of-speech tagging appears to offer a way to significantly compress the model. |
| Subsequent partial word | n-gram context state | n full words have been entered before beginning the current word. This is the most complex state, affording all of the opportunities of the Subsequent full word state, plus the option to consider recommended spellings for the incomplete word. |
For this project, I will set aside the no-context state. For zero context, the application will provide no recommendations.
Once text entry has started, I will provide recommendations for word completion as suggested by the R Aspell package combined with pattern matching against the set of observed words from the training corpus. It remains to be seen how the speller performs in a hosted Shiny app, but in R-Studio it appears promising.
With more context, the prediction should improve.
Using a random sample of one in one thousand blog entries from the corpus and a function that uses simple pattern matching, the model suggests {thing, foods, places, things, tours} as completions for the input ‘the same thin’. The aspell function returns ‘think’ among its top recommendations, so an algorithm that considers the intersection of these sets is promising.
From a quick comparison of the n-grams for n={2,3,4}, it is clear that my eventual language model will be able to benefit from compressing high-frequency bi-grams. Looking at some of the most frequent bi-grams I find that combinations of English prepositions and articles top the list. To demonstrate, note that the most frequent context words for ‘of the’ are {one of the, end of the, part of the, of the year, some of the}, compared to (in the first, in the fourth, in the region, in the US, being in the) for the bi-gram ‘in the,’ and (go to the, back to the, going to the, compared to the, to the bank) for ‘to the.’
Given the size of the training corpus, I would clearly run into high repetition and needless storage costs. A compression strategy that should work reasonably well is part of speech (POS) tagging. Efficient POS tagging is available from the openNLP package for R. An example of the storage gains is illustrated below.
Say I have the training string, “Tom is a senior vice president and managing partner.” That raw object is 152 bytes.
The set of 3-grams is {“a senior vice”,“president and managing”,“and managing partner.”,“vice president and”,“senior vice president”,“is a senior”,“Tom is a”}, costing 560 bytes of storage. Storing high-frequency n-grams may be worth that cost, but rare combinations will not provide much predictive power for the price of storage.
A better solution is to compress low frequency n-grams using part of speech tagging. A list structure storing a word as key, the two preceding parts of speech, and the frequency of that pattern should be quick to store and access, at a storage cost of 256 bytes. A variant could encode the POS codes as integers, which reduces the size to 160 bytes. However, the extra processing may not be worth that savings, so some performance monitoring should be done early on to measure that.
Regardless of whether the character or integer encoding of POS tags is used, the savings becomes significant. For example, were I to encounter forty variations describing someone as something (“tom is a…”, “alice is a…”, etc.), I would quickly approaches 10KB just to store the resultant n-grams. By comparison, the compressed POS tag entry remains fixed at 256 or 160. This part of the model should scale well.
An attainable and interesting goal for this project appears to be one that focuses on building an n-gram model with POS tagging. The stretch goal will be tack on the use of the system spell-checker, represented by the aspell function in R, to interact with the model.
Anaysis of the model should yield statistical measures of the predictive accuracy against test data, as I vary the thresholds for POS model compression, size of the training set, and inclusion of 3-grams vs. 4-grams in the model.