The capstone project of the Coursera Data Science Specialization consists in creating a word prediction data product. The basis is a corpus of text in four languages (english, german, russian and finnish) taken from blogs, news and twitter.

I limited my work to the english and german data, since I do not speak russian or finnish.

The files each contain between 200k and 2.3m lines of text:

$ wc final/{en_US,de_DE}/*
   899288  37334114 210160014 final/en_US/en_US.blogs.txt
  1010242  34365936 205811889 final/en_US/en_US.news.txt
  2360148  30359852 167105338 final/en_US/en_US.twitter.txt
   371440  12652984  85459666 final/de_DE/de_DE.blogs.txt
   244743  13216346  95591959 final/de_DE/de_DE.news.txt
   947774  11802974  75578341 final/de_DE/de_DE.twitter.txt

Together this is a bit much for experimenting and building a model. Therefore, I extracted random samples of each file containing 100k lines of text. These sample files were split into two files each, one for training and one for testing. The split was done with training = 66% and testing = 30% of the data.

I tokenized all files and this yielded a single token file for english and for german. Each token file contains one token (word) per line. It can be read into R using the read.csv function.

Here I concentrate on the training data set for the english language and do not look at the test data set nor the german data.

Language Tokens
en_US 7162314
de_DE 8055306

The tokenization process added the token ‘<s>’ for each start of a sentence and the token ‘</s>’ for each end of a sentence. Numbers and obscene words were replaced by the special token ‘<unk/>’.

The distribution of single tokens or word (also called 1-grams) is highly positively skewed. This means that there are a lot of 1-grams that occur only a few times and there are very few 2-grams that occure more than e.g. 40 times in the training set.

Lets look briefly at the top 10 1-grams:

##        ngram      x
## 141228   the 261059
## 142970    to 160682
## 12622    and 145042
## 8039       a 136924
## 101335    of 125419
## 75218     is 119691
## 71163      I 101767
## 72536     in  93492
## 141177  that  63403
## 57006    for  57783

The distribution of the 2-grams is quite similar to the one of the 1-grams:

The top 10 2-grams are:

##           ngram     x
## 1077099  of the 27341
## 780525   in the 23749
## 760930     I am 13687
## 1590056  to the 12775
## 1104355  on the 10934
## 829871    it is 10473
## 608215  for the 10196
## 1592494   to be  9457
## 824286     is a  8989
## 762246   I have  8917

Finally, we get to the distribution of the three grams:

The top 10 3-grams are:

##               ngram    x
## 1593840    I do not 3105
## 2350721  one of the 1845
## 242452     a lot of 1713
## 1589229    I am not 1552
## 1598767 I have been 1291
## 1593382   I did not 1248
## 1793956     It is a 1149
## 3309761     to be a 1049
## 1058094 do not know 1035
## 1793017     it is a  965

How many unique words are there in the training data?

## [1] 158512

How many words cover 50% of the training data?

## [1] 156

How many words cover 90% of the training data?

## [1] 10100

Since we can cover most of the data using only a relatively small subset of words, this means that we can reduce the model size significantly without incurring too much loss of accuracy.

The plan for building a predictive model is to use a multi-step approach. First look up the two-word prefix in the list of 3-grams. If there are matching 3-grams, return the most probable k 3-grams for prediction. Should there be no matching 3-gram, then back off to the 2-gram list. If there is no match there either, then use the 1-gram list to choose the most probable word as the predicted word.

Once we have a predictive model, we can use the test data to evaluate the performance of the model (perplexity, accuracy) and try to tweak the model in order to get the best compromise between model performance and resources required.