Our task here is to build a predictive model to speed text input by suggesting words that the user is likely to want to input next. To that end, we are going to design a model that can efficiently calculate and retrieve words that are likely to be desired without taking up too much memory or requiring too onerous calculation.
Ultimately, we plan to create a proof-of-concept web app using R that will take a typed word as input and use our model to present three options for subsequent words. The user will then be able to accept one of those options or type a new word, and calculate the fraction of words used that were produced by the model.
Our purpose in this document is to explore the data, attempt to find any early challenges, and gain a feel for what the final project will look like.
Right off the bat we can see that there will be a challenge selecting a sample size and storing the model without taking excessive memory and processing power. The sum of the English language data creates an 800-Mb object, and calculation of a term-document matrix took the better part of a day on a not-too-old computer. Compute times only became reasonable for exploratory analysis when we took a sample of 1% of the total. But a larger sample will ultimately be desired, given that even this 1% sample produced over 57,000 unique terms, which will have multiple relationships with all the others. Our sample provided ‘only’ a total of 793,515 words.
Taking a look at the trimmed data (with ‘trivial’ words eliminated and reduced to their stems), we can get a handle on the frequencies of the many terms, with a handful far and away more common than the rest – see the graphs below, showing much greater frequencies of the top terms in the model before stripping those common words out. Stripping out the trivial words will give us a better look at how the various words are related, but we will still need to attempt to predict around them, which will be highly context-dependent and require using n-gram models.
## [1] "Overall sample statistics:"
## Blogs Articles Tweets
## Line Count 8992 10102 23601
## Word Count 291071 275380 227064
## Unique Terms 29468 30859 25857
## [1] "Trimmed (stemmed and stopword-removed) sample statistics:"
## Blogs Articles Tweets
## Line Count 8992 10102 23601
## Word Count 190963 193876 163084
## Unique Terms 20278 21758 19789
## [1] "Most common words in the trimmed data:"
## one will get said just like time can day year make love new now know
## 3209 3194 3107 3039 3009 2935 2552 2502 2258 2146 1987 1976 1929 1826 1824
There are some simple associations in the pre-stemmed data that look reasonable, but others are less sensible, so even for non-trivial words we need to use more context. We also note that calculating these associations took a non-negligible amount of processing, and it will be painful to use these methods to keep up with input in real time. A lightweight model will be required.
This brief look at the data is only the beginning. Now we will need to find appropriate models that can compress the language down to its relations without necessarily requiring costly computations, which would only become more demanding as we look at higher n-gram relations. We will need to explore efficient models for storing and quickly calculating these relations. Part-of-speech tagging will also likely be used to help suggest words in unique combinations that can’t be found in our source samples.