The data used for this project were generously provided by SwiftKey as part of the Coursera Data Science Specialization taught by Professors Jeff Leek, Roger Peng, and Brian Caffo. The data are blocks of English text from newspapers, blogs, and twitter.
| file | size(MB) | num.of.lines | longest.line | num.of.words | |
|---|---|---|---|---|---|
| 1 | en_US.blogs.txt | 200.42 | 899288 | 483415 | 37334441 |
| 2 | en_US.news.txt | 196.28 | 77259 | 14556 | 2643972 |
| 3 | en_US.twitter.txt | 159.36 | 2360148 | 1484357 | 30373792 |
Using a subset of 5000 news excerpts, 5000 blog posts, and 5000 tweets we can use R’s quanteda package to explore the dataset. I generated a distribution of the frequency of different words (“unigrams”) and combinations of words (“n-grams”) to show how many words it’s necessary to expect the final model to predict.
| Frequency | |
|---|---|
| said | 1446.00 |
| will | 1324.00 |
| one | 1297.00 |
| just | 1183.00 |
| like | 1073.00 |
| can | 1021.00 |
| time | 899.00 |
| get | 835.00 |
| new | 814.00 |
| now | 699.00 |
| frequency | |
|---|---|
| one_of_the | 154.00 |
| a_lot_of | 150.00 |
| going_to_be | 78.00 |
| to_be_a | 77.00 |
| i_donâ_t | 74.00 |
| it_was_a | 67.00 |
| i_want_to | 65.00 |
| the_end_of | 63.00 |
| some_of_the | 62.00 |
| out_of_the | 60.00 |
The final project is to create a Shiny.io app that will:
* Receive a string of text
* Predict the next word in the string
My basic approach will consist of a 4-gram, Back-off model. I will create data frames of tokenized 4-grams, 3-grams, and 2-grams, read in the input from the user, and identify the top words that match the user’s entry. I have experimented with using corpsuses (quanteda and tm packages) and document-feature matrices as well as with more sophisticated machine learning methodologies, but in every case, my personal computer cannot handle the processing required to analyze 100,000 lines of text and crashes, despite repeated efforts to save off unneeded objects, use garbage collection (gc()), and other perfomance-improving efforts.
Regardless of the algorithmic approach, I plan to store the R objects used for prediction, save them to the server and load the model and DFM objects rather than compile on the fly in order to improve performance for the end-user.