The purpose of the Capstone project is to build a “smart” keyboard that makes it easier for people to type on their mobile devices. One cornerstone of a “smart”" keyboard is predictive text models. When someone types “I went to the”, the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone we work on understanding and building predictive text models like those used by our corporate partner SwiftKey. **The code has been suppressed and was moved to the appendix to maintain readability of this document. ## 2.Requirements For the Capsone we need to consider the following requirements.
we need to ensure that it runs in the amount of RAM you have on your computer to build the models. In my case it will run on a MacBook Pro with 8GB of memory.
we need to ensure that our Shiny app will run in less than the 1Gb free version of Shiny.
one must consider the load time for the application. You don’t want to keep your users waiting too long for your app to start.
The above requirements are actually very realistic in practice, since current available predictive text models run on mobile phones, which typically have limited memory and processing power compared to desktop computers.
Load the data from the coursera website Capstone Dataset The goal of this task is to get familiar with the databases and do the necessary cleaning.
First we get an idea of the size of the Corpora. We determine the file size of each text file in the Corpora (en_US.blogs.txt, en_US.news.txt, n_US.twitter.txt).
Then we uplaod each file and we count the number of lines in each text file.
Finally we determine the number of words in each file.
| file | size | length | words |
|---|---|---|---|
| USblogs | 200.4242 | 0.8576279 | 35.80689 |
| USnews | 196.2775 | 0.9634418 | 33.15200 |
| US twitter | 159.3641 | 2.2508125 | 28.69931 |
As you can tell each file is enormous in size. We can not use all the data for this project as it would require more compute power and memory than we have. Therefore we take an initial subset of 5% of the data.
We upload a 5% subset of the data as our training data set. Depending on how our prediction model performs we may decide to tweak this number up or down.
To build a predictive model for text we need to understand the distribution and relationship between the words, tokens, and phrases in the text.
We do profanity filtering. We remove profanity and other words we do not want to predict. We’ve downloaded a “Terms to Block” - file from Frontgate media for this purpose.
We use “Quanteda” a text analytics package which provides a rich set of text analysis features coupled with excellent performance relative to Java-based R packages for text analysis.
We remove punctuations, numbers, separators and English stopwords.
We create unigrams, bigrams and trigrams.
The following graphs show the top20 unigrams, bigrams, trigrams.
| ngram | ngramTotal | ngramNoSingleton |
|---|---|---|
| Unigram | 144160 | 64352 |
| Bigram | 1570436 | 369635 |
| Trigram | 3350828 | 346532 |
| PercentageCoverage | Uniquewords |
|---|---|
| 50% | 967 |
| 90% | 13103 |
| 99% | 51022 |
Develop a word prediction model:
build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words using input from the exploratory analysis performed above.
build a model that is optimized to run in a minimal amount of memory and takes the least amount of time to make a prediction
run the model on the shinyapps.io server
I anticipate that getting to the desired results will be an iterative process.