This is a milestone report on the progress of the JHU Data Science Capstone project. The goal of the project is to bulid a predictive model that, given the previous one or more words, predicts the most likely next word.
This report will give a summary of the features of the training data, as well as an overview of the plan for model fitting.
The training corpus for our model consists of documents from three separate sources: blog entries, news articles, and Twitter status updates (tweets). There are about 900 thousand blog entries, 1 million news articles, and 2.4 million tweets. The number of documents from each of these sources is inversely proportional to their average length – in other words, the overall amount of text from each source is approximately equal. Each source has 30 to 40 million words, and about 150 to 200 million characters.
The plot below shows histograms of the lengths of documents from each data source. The x-axis is on a logarithmic scale, so each bin represents documents that are about twice as long as documents in the bin to its left.
We can clearly see that tweets are shorter than blog posts or news articles, and there is a very sharp limit to the length of tweets at 140 characaters, exactly as we would expect.
The plot below shows the cumulative amount of text represented by the top k words. This plot can help us answer questions like, “How many words can account for 50% of text?”, and “How much of the text can the top 100 words account for?”. Again, the x-axis is on a logarithmic scale, so each heavy vertical line represents 10 times more words than the preceding line.
We can see that Twitter has less text explained by the top 10 words, and that news articles have less text explained by the top 100 to 1,000 words. However, all of the text sources have reasonably compact vocabularies. In each case, the top 10 thousand words cover about 90% of the text. This means we can focus our model on the top 10 thousand words without significantly losing functionality.
What do these top words look like? The top 10 words for each source are shown below.
| rank | News | Blog | |
|---|---|---|---|
| 1 | the | the | the |
| 2 | to | and | i |
| 3 | a | to | to |
| 4 | and | i | a |
| 5 | of | a | you |
| 6 | in | of | and |
| 7 | s | in | for |
| 8 | that | it | it |
| 9 | for | that | in |
| 10 | it | is | of |
The list is not very informative, as it is heavy on unimportant filler words, like a, the, and of. If our goal was to do text sentiment analysis, we would remove these words from consideration. However, for the purposes of predictive text, we absolutely want to be able to predict them!
Let’s look a little further down the list, at positions 101 through 110 for each source.
| rank | News | Blog | |
|---|---|---|---|
| 101 | how | ve | am |
| 102 | before | way | night |
| 103 | 10 | after | come |
| 104 | make | little | did |
| 105 | day | love | thank |
| 106 | 5 | could | only |
| 107 | where | go | here |
| 108 | county | two | well |
| 109 | your | life | why |
| 110 | says | many | them |
Here, there’s a little more differentiation. The news includes some numerals, while the blogs include words like life and love, and the tweets include more conversational words like night, thank, and here.
Finally, let’s look at some less common words that are still worth modeling. Here are positions 5,001 through 5,010 for each source.
| rank | News | Blog | |
|---|---|---|---|
| 5001 | calendar | abroad | lobby |
| 5002 | newspapers | rhythm | columbia |
| 5003 | secondary | ward | runner |
| 5004 | sustainable | stroke | oak |
| 5005 | membership | procedure | vagina |
| 5006 | vista | slave | feat |
| 5007 | dough | dull | 49 |
| 5008 | understood | consistency | hannah |
| 5009 | pete | corporation | cents |
| 5010 | mrs | ethnic | backup |
Here, the sources look more distinct. The news has the most formal vocabulary, the tweets have the least formal vocabulary, and the news reports are in the middle. The Twitter list includes a word that is not often used in polite conversation between acquaintances, but is perhaps the least offensive and most accurate term for a part of the female anatomy, and therefore not worth removing from the corpus.
My plan for model building is as follows. First, I will scrub the three corpora of words that are not in the 10 thousand most common words. I will replace the rare words with a generic “NOP” symbol – removing them entirely would affect grammar and sentence structure, and therefore model performance as a whole. Next, I will perform dimensionality reduction via GloVe, do reduce the dimensionality of the input to something manageable (say, represent each word with 128 numbers instead of treating it as one of ten thousand unique possibilities). Finally, I will train a GBM or deep neural network to the reduced data and benchmark its performance.