Overview

This article is milestone report for the Coursera Data Science Capstone Project. It is an Exploratory Analysis of data extracted from internet sources of news, blogs and twitter. The report shows some interesting finding on the data and the implications of using the data to predict the next word in a stream of text input

Basic Statistics

The table below shows a brief analysis of the input data. The file sizes and numbers of rows are based on the full data. The number of words are estimates for the total files based on a sample of 10000 lines from each file. To tokenise the files the data was washed to lowercase all characters and split into words based on groups of letters or an apostrophe.

Basic Stats
SOURCE FILE SIZE LINES WORDS UNIQUE
1 BLOG en_US.blogs.txt 210160014 899288 37602739 29432
2 NEWS en_US.news.txt 205811889 77259 2633095 29055
3 TWITTER en_US.twitter.txt 167105338 2360148 29834631 14983

Word Usage Patterns

The aim is to predict the next word of text from the previous n words typed in. The application would offer a limited number of shortcuts with different words to choose from and if the shortcut is not available then the user will just have to type in the whole word. It stands to reason that there is no point in offering rare words as they will never be likely enough to be listed in the shortcuts. So in order to decide how many words will be on the offer list we will need to find a sweet spot which provides a good coverage of the words without being too long a list.

From the graph it appears the sweet spot is around 5000 words, providing more than 80% coverage. The table below shows the word in various ranked positions as an example of the words that would be excluded at various cut-offs.

position blog news twitter TOTAL
the 1 20853 19652 4080 44585
day 50 609 307 406 1322
many 100 444 285 74 803
list 500 76 76 25 177
built 1000 34 52 6 92
missouri 2000 5 39 1 45
software 3000 14 12 3 29
pitcher 4000 2 15 3 20
notion 5000 9 6 0 15
terrific 6000 6 4 2 12
beijing 7000 4 5 0 9
plaza 8000 1 6 1 8
timely 9000 4 1 2 7
sinned 10000 3 1 2 6

n-Grams

The above analysis shows that we will need a vocabulary of about 5000 words that could be predicted. These are 1-grams. Not very useful, as a 1-gram prediction algorithm will just predict the top k (say 3) most common words ie (the and that). We want the predictions to change as the user types in words. So the predicted word will need to take into account the previous word (2-grams) or the 2 previous words (3-grams) and so on. But this space gets huge. Even if the vocabulary is limited to 500 words, that makes a potential 25000000 2-grams, 525000000000 3-grams and so on. In practice the 2-gram list would need to modify the 1-gram popularity list so a word like ‘carpe’ always implies the next word is a ‘diem’. The number of additional 2-grams will need to be quite small and only contain those that significantly change the probabilities associated with the next word, likewise 3 grams and so on. I expect tuning these factors will be the most difficult task of the prediction algorithm.

Multiple predictions

The task here is not to predict the next word excatly, but to predict a small set of words that include the next word. This means we need to score and prioritise the different n-gram predictors in order to learn the best model.

Plan of action

My plan of action is to: - Define the vocabulary and 1-gram predictor number of words (n1)
- find all these words in the training data and build a list of 2-gram predictors of these words
- prune the 2-gram predictors to a reasonable number (n2) and a significant improvement on predictive power over the 1-grams (s2)
- find 3-gram predictors with number n3 and significance s3
- maybe stop at this point, maybe go on to 4-grams
- using a validation sample, try to determine the optimal values for n1,n2,n3,…, s2, s3,… as well as which model to use and how to fall back to the lower model
- once we have the optimal values fit the model on the largest training set my computer can handle.