JHU DSS Capstone Milestone Report

Background

The capstone project for the John Hopkins University Data Science Specialization is focused on building a text predictor based on three text files. One file is a collection of blog entries, one of news entries, and one of twitter ‘tweets’. There has been no per-processing of the text files.

Data File summary

The following statistics for each file was generated using the base R, the qdap and tau packages, and a several custom functions to clean the data for comparisons to the raw data:

	News	Blogs	Twitter
Number of Lines	1010243	899289	2360149
Total Words (w/Repeats)	34111041	37014114	30258769
Total Words (Unique)	742050	1028170	1040151
Total Words (Cleaned and w/Repeats)	26961723	29252659	23699092
Total Words (Cleaned and Unique)	218542	237457	290150

Each of the text files contains hundreds of thousands of lines, with the twitter file containing over 2 million. When broken down into words (or word-like structures), there are over 30 million in each file. Consolidating into unique words does drastically reduce the number by almost a factor of 30 for each file. cleaning up the text by removing numbers, converting to lower case, and removing excess white space drastically compresses the data to between 24 million and 29 million words, and only 218,000 to 290,150 unique words. The only caveat with the removal of punctuation was the attempt to preserve contractions. The regular expressions used did preserve most of them, but there are still instances of broken contractions in the text. With all three cleaned and unique words are combined, the entire vocabulary for the training set is 521,541 words. However, analysis has shown that some words present are over 50 characters long, which are most likely non-words that are still in the files.

Exploratory Analysis

Given the observations on the raw files, and the limitations of the hardware being used, the data files were randomly sampled to pull 10% of the content for further analysis. After sampling, the data files were cleaned of numbers and other non-alphabetic text, stripped of excess white-space, and converted to all lower-case letters. Non-English words were left in at this point for evaluation. No profanity filter has been implemented at this point. Currently the plan is to leave profanity in the vocabulary and prediction model, but profanity will not be displayed as a predicted word in the final app.

With the cleaned up texts, the following table on words per entry and length of words was created, using the qdap function word_count(), to judge similarities and differences between the three files:

The histograms for News and Blogs were limited since the data shows a long tail, up to 6000 words in one line for one file, for the number of words per line.

The News distribution looks the most interesting of the three. It is almost bi-modal, with peaks at 2-3 words/line and at 26-28 words/line. The may be due to the fact that some lines are news headlines, while others are short blurbs or articles on the headlines. Blogs shows a peak at 4-5 words/line and is right skewed, with a long tail also. Twitter shows the densest concentration of words due to the 140 character limit the service imposes. It could almost be modeled with a uniform distribution from 1-20 words/line.

All three text files show unique characteristics in the distribution of the number of words per line, which may allow for a means to distinguish between the types of input once the prediction model is in use.

The next step is to look at n-grams for each type of file. N-grams were created using the txtcnt() function from tau package. In addition, non-English words and characters were removed from the n-grams after generation. This sequence was chosen to prevent n-grams across non-English words, thus reducing the number of low occurrence n-grams. The following graphs show the Top 10 Unigrams, Bi-grams, Tri-grams, and Four-grams for each text file:

The first observation from the various n-gram charts is that stop words dominate the texts. Once tri-grams and four-grams show up, less common words start appearing, but it seems that stop words form the majority of sentence structure. It is also interesting how the twitter unigram, or vocabulary, is more evenly distributed than the other two. This is similar to the pattern that was observed in the number of words/line. It also stands out how similar the top 10’s are for the unigrams and bi-grams for all three. The richer vocabulary in news and blogs starts to show itself in the tri-grams and four-grams charts. A final observation is how the frequency of the top 10 n-grams goes down as the number of words/n-gram increase.

Moving Forward: Model Building

The objective of the past exercises has been to build n-grams for predictive purposes; user enters some text and the last n words are matched with the n-gram tables to predict the next word. A common implementation of this is the ‘Stupid Backoff’ model where you match three words to the first three words of a four-gram and use the fourth word as the prediction. If no match found, then you ‘backoff’ to the tri-gram and repeat until you get to the most common word in the unigram list. This strategy works great if you have every combination of words in your tables, but that is resource intenisve and not applicable. Based on that fact, the model has to be able to predict words from sequences that it has no inforamtion on.

Based on the above restrictions, the following tasks need to be completed in order to buidl a text prediction model:

Refine the cleaning process in order to eliminate non-words from the vocabulary.
Install some process to correct simple mis-spellings in order to remove non-words from the vocabulary.
Build a basic ‘Stupid Backoff’ model to get a baseline predictor.
Experiment with different n-gram minimum count thresholds to reduce the size of the n-gram tables while preserving prediction ability.
Develop a means to rank n-grams based on key words, in addition to the count of the number of times the n-gram is in the documents. Possibly contender is tf-idf weighting of n-grams.
investigate the possibility of having to use skip-grams for prediction.
Implement model into a shiny application.
Implement a profanity filter into the model and shiny application.
Investigate the possibility of adding additional words to the vocabulary from an outside source.
If time permits, add autocompletion to avoid misspellings and possible multiword prediction.

JHU DSS Capstone Milestone Report

Dennis Chandler

Thursday, March 19, 2015

Background

Data File summary

Exploratory Analysis

Moving Forward: Model Building