Introduction

This is a milestone report on the progress of the JHU Data Science Capstone project. The goal of the project is to bulid a predictive model that, given the previous one or more words, predicts the most likely next word.

This report will give a summary of the features of the training data, as well as an overview of the plan for model fitting.

Data Summary

The training corpus for our model consists of documents from three separate sources: blog entries, news articles, and Twitter status updates (tweets). There are about 900 thousand blog entries, 1 million news articles, and 2.4 million tweets. The number of documents from each of these sources is inversely proportional to their average length – in other words, the overall amount of text from each source is approximately equal. Each source has 30 to 40 million words, and about 150 to 200 million characters.

The plot below shows histograms of the lengths of documents from each data source. The x-axis is on a logarithmic scale, so each bin represents documents that are about twice as long as documents in the bin to its left.

We can clearly see that tweets are shorter than blog posts or news articles, and there is a very sharp limit to the length of tweets at 140 characaters, exactly as we would expect.

Distribution of Words

The plot below shows the cumulative amount of text represented by the top k words. This plot can help us answer questions like, “How many words can account for 50% of text?”, and “How much of the text can the top 100 words account for?”. Again, the x-axis is on a logarithmic scale, so each heavy vertical line represents 10 times more words than the preceding line.

We can see that Twitter has less text explained by the top 10 words, and that news articles have less text explained by the top 100 to 1,000 words. However, all of the text sources have reasonably compact vocabularies. In each case, the top 10 thousand words cover about 90% of the text. This means we can focus our model on the top 10 thousand words without significantly losing functionality.

Common Words

What do these top words look like? The top 10 words for each source are shown below.

rank News Blog Twitter
1 the the the
2 to and i
3 a to to
4 and i a
5 of a you
6 in of and
7 s in for
8 that it it
9 for that in
10 it is of

The list is not very informative, as it is heavy on unimportant filler words, like a, the, and of. If our goal was to do text sentiment analysis, we would remove these words from consideration. However, for the purposes of predictive text, we absolutely want to be able to predict them!

Let’s look a little further down the list, at positions 101 through 110 for each source.

rank News Blog Twitter
101 how ve am
102 before way night
103 10 after come
104 make little did
105 day love thank
106 5 could only
107 where go here
108 county two well
109 your life why
110 says many them

Here, there’s a little more differentiation. The news includes some numerals, while the blogs include words like life and love, and the tweets include more conversational words like night, thank, and here.

Finally, let’s look at some less common words that are still worth modeling. Here are positions 5,001 through 5,010 for each source.

rank News Blog Twitter
5001 calendar abroad lobby
5002 newspapers rhythm columbia
5003 secondary ward runner
5004 sustainable stroke oak
5005 membership procedure vagina
5006 vista slave feat
5007 dough dull 49
5008 understood consistency hannah
5009 pete corporation cents
5010 mrs ethnic backup

Here, the sources look more distinct. The news has the most formal vocabulary, the tweets have the least formal vocabulary, and the news reports are in the middle. The Twitter list includes a word that is not often used in polite conversation between acquaintances, but is perhaps the least offensive and most accurate term for a part of the female anatomy, and therefore not worth removing from the corpus.

Analysis Plan

My plan for model building is as follows. First, I will scrub the three corpora of words that are not in the 10 thousand most common words. I will replace the rare words with a generic “NOP” symbol – removing them entirely would affect grammar and sentence structure, and therefore model performance as a whole. Next, I will perform dimensionality reduction via GloVe, do reduce the dimensionality of the input to something manageable (say, represent each word with 128 numbers instead of treating it as one of ten thousand unique possibilities). Finally, I will train a GBM or deep neural network to the reduced data and benchmark its performance.