Executive Summary

Our task here is to build a predictive model to speed text input by suggesting words that the user is likely to want to input next. To that end, we are going to design a model that can efficiently calculate and retrieve words that are likely to be desired without taking up too much memory or requiring too onerous calculation.

Ultimately, we plan to create a proof-of-concept web app using R that will take a typed word as input and use our model to present three options for subsequent words. The user will then be able to accept one of those options or type a new word, and calculate the fraction of words used that were produced by the model.

Our purpose in this document is to explore the data, attempt to find any early challenges, and gain a feel for what the final project will look like.

Data Exploration

Right off the bat we can see that there will be a challenge selecting a sample size and storing the model without taking excessive memory and processing power. The sum of the English language data creates an 800-Mb object, and calculation of a term-document matrix took the better part of a day on a not-too-old computer. Compute times only became reasonable for exploratory analysis when we took a sample of 1% of the total. But a larger sample will ultimately be desired, given that even this 1% sample produced over 57,000 unique terms, which will have multiple relationships with all the others. Our sample provided ‘only’ a total of 793,515 words.

Taking a look at the trimmed data (with ‘trivial’ words eliminated and reduced to their stems), we can get a handle on the frequencies of the many terms, with a handful far and away more common than the rest – see the graphs below, showing much greater frequencies of the top terms in the model before stripping those common words out. Stripping out the trivial words will give us a better look at how the various words are related, but we will still need to attempt to predict around them, which will be highly context-dependent and require using n-gram models.

## [1] "Overall sample statistics:"
##               Blogs Articles Tweets
## Line Count     8992    10102  23601
## Word Count   291071   275380 227064
## Unique Terms  29468    30859  25857
## [1] "Trimmed (stemmed and stopword-removed) sample statistics:"
##               Blogs Articles Tweets
## Line Count     8992    10102  23601
## Word Count   190963   193876 163084
## Unique Terms  20278    21758  19789
## [1] "Most common words in the trimmed data:"
##  one will  get said just like time  can  day year make love  new  now know 
## 3209 3194 3107 3039 3009 2935 2552 2502 2258 2146 1987 1976 1929 1826 1824

There are some simple associations in the pre-stemmed data that look reasonable, but others are less sensible, so even for non-trivial words we need to use more context. We also note that calculating these associations took a non-negligible amount of processing, and it will be painful to use these methods to keep up with input in real time. A lightweight model will be required.

Conclusion

This brief look at the data is only the beginning. Now we will need to find appropriate models that can compress the language down to its relations without necessarily requiring costly computations, which would only become more demanding as we look at higher n-gram relations. We will need to explore efficient models for storing and quickly calculating these relations. Part-of-speech tagging will also likely be used to help suggest words in unique combinations that can’t be found in our source samples.