The capstone project for the John Hopkins University Data Science Specialization is focused on building a text predictor based on three text files. One file is a collection of blog entries, one of news entries, and one of twitter ‘tweets’. There has been no per-processing of the text files.
The following statistics for each file was generated using the base R, the qdap and tau packages, and a several custom functions to clean the data for comparisons to the raw data:
| News | Blogs | ||
|---|---|---|---|
| Number of Lines | 1010243 | 899289 | 2360149 |
| Total Words (w/Repeats) | 34111041 | 37014114 | 30258769 |
| Total Words (Unique) | 742050 | 1028170 | 1040151 |
| Total Words (Cleaned and w/Repeats) | 26961723 | 29252659 | 23699092 |
| Total Words (Cleaned and Unique) | 218542 | 237457 | 290150 |
Each of the text files contains hundreds of thousands of lines, with the twitter file containing over 2 million. When broken down into words (or word-like structures), there are over 30 million in each file. Consolidating into unique words does drastically reduce the number by almost a factor of 30 for each file. cleaning up the text by removing numbers, converting to lower case, and removing excess white space drastically compresses the data to between 24 million and 29 million words, and only 218,000 to 290,150 unique words. The only caveat with the removal of punctuation was the attempt to preserve contractions. The regular expressions used did preserve most of them, but there are still instances of broken contractions in the text. With all three cleaned and unique words are combined, the entire vocabulary for the training set is 521,541 words. However, analysis has shown that some words present are over 50 characters long, which are most likely non-words that are still in the files.
Given the observations on the raw files, and the limitations of the hardware being used, the data files were randomly sampled to pull 10% of the content for further analysis. After sampling, the data files were cleaned of numbers and other non-alphabetic text, stripped of excess white-space, and converted to all lower-case letters. Non-English words were left in at this point for evaluation. No profanity filter has been implemented at this point. Currently the plan is to leave profanity in the vocabulary and prediction model, but profanity will not be displayed as a predicted word in the final app.
With the cleaned up texts, the following table on words per entry and length of words was created, using the qdap function word_count(), to judge similarities and differences between the three files:
The histograms for News and Blogs were limited since the data shows a long tail, up to 6000 words in one line for one file, for the number of words per line.
The News distribution looks the most interesting of the three. It is almost bi-modal, with peaks at 2-3 words/line and at 26-28 words/line. The may be due to the fact that some lines are news headlines, while others are short blurbs or articles on the headlines. Blogs shows a peak at 4-5 words/line and is right skewed, with a long tail also. Twitter shows the densest concentration of words due to the 140 character limit the service imposes. It could almost be modeled with a uniform distribution from 1-20 words/line.
All three text files show unique characteristics in the distribution of the number of words per line, which may allow for a means to distinguish between the types of input once the prediction model is in use.
The next step is to look at n-grams for each type of file. N-grams were created using the txtcnt() function from tau package. In addition, non-English words and characters were removed from the n-grams after generation. This sequence was chosen to prevent n-grams across non-English words, thus reducing the number of low occurrence n-grams. The following graphs show the Top 10 Unigrams, Bi-grams, Tri-grams, and Four-grams for each text file:
The first observation from the various n-gram charts is that stop words dominate the texts. Once tri-grams and four-grams show up, less common words start appearing, but it seems that stop words form the majority of sentence structure. It is also interesting how the twitter unigram, or vocabulary, is more evenly distributed than the other two. This is similar to the pattern that was observed in the number of words/line. It also stands out how similar the top 10’s are for the unigrams and bi-grams for all three. The richer vocabulary in news and blogs starts to show itself in the tri-grams and four-grams charts. A final observation is how the frequency of the top 10 n-grams goes down as the number of words/n-gram increase.
The objective of the past exercises has been to build n-grams for predictive purposes; user enters some text and the last n words are matched with the n-gram tables to predict the next word. A common implementation of this is the ‘Stupid Backoff’ model where you match three words to the first three words of a four-gram and use the fourth word as the prediction. If no match found, then you ‘backoff’ to the tri-gram and repeat until you get to the most common word in the unigram list. This strategy works great if you have every combination of words in your tables, but that is resource intenisve and not applicable. Based on that fact, the model has to be able to predict words from sequences that it has no inforamtion on.
Based on the above restrictions, the following tasks need to be completed in order to buidl a text prediction model: