This article is milestone report for the Coursera Data Science Capstone Project. It is an Exploratory Analysis of data extracted from internet sources of news, blogs and twitter. The report shows some interesting finding on the data and the implications of using the data to predict the next word in a stream of text input
The table below shows a brief analysis of the input data. The file sizes and numbers of rows are based on the full data. The number of words are estimates for the total files based on a sample of 10000 lines from each file. To tokenise the files the data was washed to lowercase all characters and split into words based on groups of letters or an apostrophe.
| SOURCE | FILE | SIZE | LINES | WORDS | UNIQUE | |
|---|---|---|---|---|---|---|
| 1 | BLOG | en_US.blogs.txt | 210160014 | 899288 | 37602739 | 29432 |
| 2 | NEWS | en_US.news.txt | 205811889 | 77259 | 2633095 | 29055 |
| 3 | en_US.twitter.txt | 167105338 | 2360148 | 29834631 | 14983 |
The aim is to predict the next word of text from the previous n words typed in. The application would offer a limited number of shortcuts with different words to choose from and if the shortcut is not available then the user will just have to type in the whole word. It stands to reason that there is no point in offering rare words as they will never be likely enough to be listed in the shortcuts. So in order to decide how many words will be on the offer list we will need to find a sweet spot which provides a good coverage of the words without being too long a list.
From the graph it appears the sweet spot is around 5000 words, providing more than 80% coverage. The table below shows the word in various ranked positions as an example of the words that would be excluded at various cut-offs.
| position | blog | news | TOTAL | ||
|---|---|---|---|---|---|
| the | 1 | 20853 | 19652 | 4080 | 44585 |
| day | 50 | 609 | 307 | 406 | 1322 |
| many | 100 | 444 | 285 | 74 | 803 |
| list | 500 | 76 | 76 | 25 | 177 |
| built | 1000 | 34 | 52 | 6 | 92 |
| missouri | 2000 | 5 | 39 | 1 | 45 |
| software | 3000 | 14 | 12 | 3 | 29 |
| pitcher | 4000 | 2 | 15 | 3 | 20 |
| notion | 5000 | 9 | 6 | 0 | 15 |
| terrific | 6000 | 6 | 4 | 2 | 12 |
| beijing | 7000 | 4 | 5 | 0 | 9 |
| plaza | 8000 | 1 | 6 | 1 | 8 |
| timely | 9000 | 4 | 1 | 2 | 7 |
| sinned | 10000 | 3 | 1 | 2 | 6 |
The above analysis shows that we will need a vocabulary of about 5000 words that could be predicted. These are 1-grams. Not very useful, as a 1-gram prediction algorithm will just predict the top k (say 3) most common words ie (the and that). We want the predictions to change as the user types in words. So the predicted word will need to take into account the previous word (2-grams) or the 2 previous words (3-grams) and so on. But this space gets huge. Even if the vocabulary is limited to 500 words, that makes a potential 25000000 2-grams, 525000000000 3-grams and so on. In practice the 2-gram list would need to modify the 1-gram popularity list so a word like ‘carpe’ always implies the next word is a ‘diem’. The number of additional 2-grams will need to be quite small and only contain those that significantly change the probabilities associated with the next word, likewise 3 grams and so on. I expect tuning these factors will be the most difficult task of the prediction algorithm.
The task here is not to predict the next word excatly, but to predict a small set of words that include the next word. This means we need to score and prioritise the different n-gram predictors in order to learn the best model.
My plan of action is to: - Define the vocabulary and 1-gram predictor number of words (n1)
- find all these words in the training data and build a list of 2-gram predictors of these words
- prune the 2-gram predictors to a reasonable number (n2) and a significant improvement on predictive power over the 1-grams (s2)
- find 3-gram predictors with number n3 and significance s3
- maybe stop at this point, maybe go on to 4-grams
- using a validation sample, try to determine the optimal values for n1,n2,n3,…, s2, s3,… as well as which model to use and how to fall back to the lower model
- once we have the optimal values fit the model on the largest training set my computer can handle.