Milestone Report on Swiftkey Data

Overview

This article is milestone report for the Coursera Data Science Capstone Project. It is an Exploratory Analysis of data extracted from internet sources of news, blogs and twitter. The report shows some interesting finding on the data and the implications of using the data to predict the next word in a stream of text input

Basic Statistics

The table below shows a brief analysis of the input data. The file sizes and numbers of rows are based on the full data. The number of words are estimates for the total files based on a sample of 10000 lines from each file. To tokenise the files the data was washed to lowercase all characters and split into words based on groups of letters or an apostrophe.

Basic Stats
	SOURCE	FILE	SIZE	LINES	WORDS	UNIQUE
1	BLOG	en_US.blogs.txt	210160014	899288	37602739	29432
2	NEWS	en_US.news.txt	205811889	77259	2633095	29055
3	TWITTER	en_US.twitter.txt	167105338	2360148	29834631	14983

Word Usage Patterns

The aim is to predict the next word of text from the previous n words typed in. The application would offer a limited number of shortcuts with different words to choose from and if the shortcut is not available then the user will just have to type in the whole word. It stands to reason that there is no point in offering rare words as they will never be likely enough to be listed in the shortcuts. So in order to decide how many words will be on the offer list we will need to find a sweet spot which provides a good coverage of the words without being too long a list.

From the graph it appears the sweet spot is around 5000 words, providing more than 80% coverage. The table below shows the word in various ranked positions as an example of the words that would be excluded at various cut-offs.

	position	blog	news	twitter	TOTAL
the	1	20853	19652	4080	44585
day	50	609	307	406	1322
many	100	444	285	74	803
list	500	76	76	25	177
built	1000	34	52	6	92
missouri	2000	5	39	1	45
software	3000	14	12	3	29
pitcher	4000	2	15	3	20
notion	5000	9	6	0	15
terrific	6000	6	4	2	12
beijing	7000	4	5	0	9
plaza	8000	1	6	1	8
timely	9000	4	1	2	7
sinned	10000	3	1	2	6

n-Grams

The above analysis shows that we will need a vocabulary of about 5000 words that could be predicted. These are 1-grams. Not very useful, as a 1-gram prediction algorithm will just predict the top k (say 3) most common words ie (the and that). We want the predictions to change as the user types in words. So the predicted word will need to take into account the previous word (2-grams) or the 2 previous words (3-grams) and so on. But this space gets huge. Even if the vocabulary is limited to 500 words, that makes a potential 25000000 2-grams, 525000000000 3-grams and so on. In practice the 2-gram list would need to modify the 1-gram popularity list so a word like ‘carpe’ always implies the next word is a ‘diem’. The number of additional 2-grams will need to be quite small and only contain those that significantly change the probabilities associated with the next word, likewise 3 grams and so on. I expect tuning these factors will be the most difficult task of the prediction algorithm.

Multiple predictions

The task here is not to predict the next word excatly, but to predict a small set of words that include the next word. This means we need to score and prioritise the different n-gram predictors in order to learn the best model.

Plan of action

My plan of action is to: - Define the vocabulary and 1-gram predictor number of words (n1)
- find all these words in the training data and build a list of 2-gram predictors of these words
- prune the 2-gram predictors to a reasonable number (n2) and a significant improvement on predictive power over the 1-grams (s2)
- find 3-gram predictors with number n3 and significance s3
- maybe stop at this point, maybe go on to 4-grams
- using a validation sample, try to determine the optimal values for n1,n2,n3,…, s2, s3,… as well as which model to use and how to fall back to the lower model
- once we have the optimal values fit the model on the largest training set my computer can handle.