Data Science Capstone - Milestone Report

Introduction

This is a milestone report on the progress of the JHU Data Science Capstone project. The goal of the project is to bulid a predictive model that, given the previous one or more words, predicts the most likely next word.

This report will give a summary of the features of the training data, as well as an overview of the plan for model fitting.

Data Summary

The training corpus for our model consists of documents from three separate sources: blog entries, news articles, and Twitter status updates (tweets). There are about 900 thousand blog entries, 1 million news articles, and 2.4 million tweets. The number of documents from each of these sources is inversely proportional to their average length – in other words, the overall amount of text from each source is approximately equal. Each source has 30 to 40 million words, and about 150 to 200 million characters.

The plot below shows histograms of the lengths of documents from each data source. The x-axis is on a logarithmic scale, so each bin represents documents that are about twice as long as documents in the bin to its left.

We can clearly see that tweets are shorter than blog posts or news articles, and there is a very sharp limit to the length of tweets at 140 characaters, exactly as we would expect.

Distribution of Words

The plot below shows the cumulative amount of text represented by the top k words. This plot can help us answer questions like, “How many words can account for 50% of text?”, and “How much of the text can the top 100 words account for?”. Again, the x-axis is on a logarithmic scale, so each heavy vertical line represents 10 times more words than the preceding line.

We can see that Twitter has less text explained by the top 10 words, and that news articles have less text explained by the top 100 to 1,000 words. However, all of the text sources have reasonably compact vocabularies. In each case, the top 10 thousand words cover about 90% of the text. This means we can focus our model on the top 10 thousand words without significantly losing functionality.

Common Words

What do these top words look like? The top 10 words for each source are shown below.

rank	News	Blog	Twitter
1	the	the	the
2	to	and	i
3	a	to	to
4	and	i	a
5	of	a	you
6	in	of	and
7	s	in	for
8	that	it	it
9	for	that	in
10	it	is	of

The list is not very informative, as it is heavy on unimportant filler words, like a, the, and of. If our goal was to do text sentiment analysis, we would remove these words from consideration. However, for the purposes of predictive text, we absolutely want to be able to predict them!

Let’s look a little further down the list, at positions 101 through 110 for each source.

rank	News	Blog	Twitter
101	how	ve	am
102	before	way	night
103	10	after	come
104	make	little	did
105	day	love	thank
106	5	could	only
107	where	go	here
108	county	two	well
109	your	life	why
110	says	many	them

Here, there’s a little more differentiation. The news includes some numerals, while the blogs include words like life and love, and the tweets include more conversational words like night, thank, and here.

Finally, let’s look at some less common words that are still worth modeling. Here are positions 5,001 through 5,010 for each source.

rank	News	Blog	Twitter
5001	calendar	abroad	lobby
5002	newspapers	rhythm	columbia
5003	secondary	ward	runner
5004	sustainable	stroke	oak
5005	membership	procedure	vagina
5006	vista	slave	feat
5007	dough	dull	49
5008	understood	consistency	hannah
5009	pete	corporation	cents
5010	mrs	ethnic	backup

Here, the sources look more distinct. The news has the most formal vocabulary, the tweets have the least formal vocabulary, and the news reports are in the middle. The Twitter list includes a word that is not often used in polite conversation between acquaintances, but is perhaps the least offensive and most accurate term for a part of the female anatomy, and therefore not worth removing from the corpus.

Analysis Plan

My plan for model building is as follows. First, I will scrub the three corpora of words that are not in the 10 thousand most common words. I will replace the rare words with a generic “NOP” symbol – removing them entirely would affect grammar and sentence structure, and therefore model performance as a whole. Next, I will perform dimensionality reduction via GloVe, do reduce the dimensionality of the input to something manageable (say, represent each word with 128 numbers instead of treating it as one of ten thousand unique possibilities). Finally, I will train a GBM or deep neural network to the reduced data and benchmark its performance.