This document summarizes the data that will be used for initial model building for predicting the next word based on previous set of words. There are three separate data sets (about 200 ~ 300 MB each):
The document also briefly explains the approach that will be used to create and implement a word prediction model using this data.
Let’s first look at the summary of all the data:
Units | Blogs | News | Tweets | |
---|---|---|---|---|
Size | MB | 261 | 262 | 316 |
Elements | Millions | 0.899 | 1.01 | 2.36 |
Sentences | Millions | 2.29 | 2.19 | 3.23 |
Words | Millions | 37.3 | 34.4 | 30.4 |
Since these are fairly large data sets, it is found that using a random sample from these data sets is a better approach for EDA. For this analysis, 1% of each data set is randomly sampled. This avoids memory issues but still would be representative of the entire data. It is worth noting that although tweets contain a third more sentences than blogs and news, these actually contain less total number of words. This is reasonable since tweets are limited to 140 characters
After sampling the data, the smaller data set is cleaned by removing extra whitespaces, stopwords, punctuation, numbers and changing the text to lowercase. Now let’s look at the words using wordcloud and barplots of word frequencies from blog, news and twitter respectively.
One interesting thing to note here is that people generally express positive emotions (good, great, thanks, love) when sharing their thooughts on twitter. One can also see that news mostly contain past events (with “said” as the most common word) which is not surprising. We also observe that blogs mainly talk about present events and are written in a more likeable tone. Looking at the bar plot of word frequencies from the data, similar observations can be made.
The fact that the three candidate datasets will be used to build the n-gram prediction model is a good starting point since these contain unique characteristics of the language with respect to emotions and time. Our goal is to build a model with as much information as possible for better accuracy without getting too big to loose responsiveness due to larger memory footprint. Our plan will be:
To combine all three data sets into one and then split into a) training (80%), b) development (10%), and c) test (10%) data. Training set will be used for model building while development set will be used to imoprove the model performance. Test data will be kept aside and will only be used once to assess the final model’s performance.
Then we will clean the data similar to what we did in exploratory analysis but we will keep the stop words since we want to predict those as well. In addition, any explicit language will also be removed before building the prediction model.
Using the training set, we will build 1 through 5 grams. These grams will be split at the last word to calculate the maximum likelelihood estimate (MLE) of the last word given the first n-1 words. For example, a 3-gram “how are you” will be split into “how are” and “you” to calculate how often “you” follows “how are”.
Then a search algorithm will be built that will match the first part (n-1 words) of the n-gram with the string entered by user and last word of that gram with highest MLE will be presented to the user.
To accomodate unseen strings, a back-off approach may be used, where if no matching string is found in the n-gram, it will jump to (n-1) gram data for match going all the way down to unigrams.
Accuracy and performance of the model will be assessed using the development data set to check how large is the model, how much time it takes to present the next word and how accurate it is.
The final model will be integrated into a Shiny app where a user can enter a string of words and they will be presented with a likely next word or top three words.