This report describes the exploratory analysis of the data provided as part of the Coursera Data Science Capstone Project. The data files have been read and some characteristics of the data and plots are provided. The report concludes with a plan for a text prediction algorithm that will be hosted on www.shinyapps.io
The data consists of three files with samples of twitter, blogs, and news text as indicted in the following table
| File | Size (Bytes) | Lines | Words |
|---|---|---|---|
| en_US.twitter.txt | 167105338 | 2360148 | 30373832 |
| en_US.blogs.txt | 210160014 | 899288 | 37334441 |
| en_US.news.txt | 205811889 | 1010242 | 34372598 |
The files were read into R and basic text data cleaning was done to remove profanity, stop words, punctuation, and numbers. The Text Mining Package was used for most of the text processing. The number of lines and words were reduced as indicted in the following table:
| Text | Lines | Words |
|---|---|---|
| 2360113 | 20602310 | |
| blogs | 898479 | 14935514 |
| news | 77223 | 1201591 |
It is interesting to note the relative effect of the data cleaning on the different types of text. The twitter text maintains a large number words while the news reduces down by a larger extent. The count of the single words left in the three texts are shown on the following graphs. Note that the distribution tails off with a very long tail. This kind of distribution is commom for word counts in texts.
Now that the data is available the next step will be to develop the predication application in Shiny. The idea is to develop a model based on on a percentage of the text data or commonly called corpora for training and smaller amount saved for testing. The most common word prediction model uses the concept of N-Grams, a N-Gram is a set of words of the size N that are found within the training corpora. The 2-Gram or Bigram model uses the N-1 word in the set to predict the next word. Given a specific word there is a derived probability of what the next word should be. A 3-Gram or Trigram model uses the combination of the first and second words to predict the third word. A 4-Gram or Quadrigram model constrains the word selection even more resulting in a more accurate model. There are different techniques for dealing with situations where corpora does not have matching N-Grams. This situation is addressed with several methods. The one I will use is called the Backoff Method. When a N-Gram is not matched then we backoff to try to match a smaller N-1 N-Gram and so on until a word can be predicted.
Once the model is developed I will use the test data set to determine how well the model performs. A concept called Perplexity is used to compute the appropriateness of the model. Now there are the challenges in making the application small enough to fit into the hosted Shiny server. Using the host memory effectively will be necessary. A reasonable response time for the user is important given that this application is meant to simulate a word prediction application found on smartphone or tablet. So trade offs in the model will likely have to be made. I will use the perplexity measurement to determine how much of an impact my trade offs have on the model. One other idea that would be interesting would be to use the user’s input to add to the corpora and adjust the model on the fly. That will be the bonus once the application framework is in place.