The capstone project of the Coursera Data Science Specialization consists in creating a word prediction data product. The basis is a corpus of text in four languages (english, german, russian and finnish) taken from blogs, news and twitter.
I limited my work to the english and german data, since I do not speak russian or finnish.
The files each contain between 200k and 2.3m lines of text:
$ wc final/{en_US,de_DE}/*
899288 37334114 210160014 final/en_US/en_US.blogs.txt
1010242 34365936 205811889 final/en_US/en_US.news.txt
2360148 30359852 167105338 final/en_US/en_US.twitter.txt
371440 12652984 85459666 final/de_DE/de_DE.blogs.txt
244743 13216346 95591959 final/de_DE/de_DE.news.txt
947774 11802974 75578341 final/de_DE/de_DE.twitter.txt
Together this is a bit much for experimenting and building a model. Therefore, I extracted random samples of each file containing 100k lines of text. These sample files were split into two files each, one for training and one for testing. The split was done with training = 66% and testing = 30% of the data.
I tokenized all files and this yielded a single token file for english and for german. Each token file contains one token (word) per line. It can be read into R using the read.csv function.
Here I concentrate on the training data set for the english language and do not look at the test data set nor the german data.
| Language | Tokens |
|---|---|
| en_US | 7162314 |
| de_DE | 8055306 |
The tokenization process added the token ‘<s>’ for each start of a sentence and the token ‘</s>’ for each end of a sentence. Numbers and obscene words were replaced by the special token ‘<unk/>’.
The distribution of single tokens or word (also called 1-grams) is highly positively skewed. This means that there are a lot of 1-grams that occur only a few times and there are very few 2-grams that occure more than e.g. 40 times in the training set.
Lets look briefly at the top 10 1-grams:
## ngram x
## 141228 the 261059
## 142970 to 160682
## 12622 and 145042
## 8039 a 136924
## 101335 of 125419
## 75218 is 119691
## 71163 I 101767
## 72536 in 93492
## 141177 that 63403
## 57006 for 57783
The distribution of the 2-grams is quite similar to the one of the 1-grams:
The top 10 2-grams are:
## ngram x
## 1077099 of the 27341
## 780525 in the 23749
## 760930 I am 13687
## 1590056 to the 12775
## 1104355 on the 10934
## 829871 it is 10473
## 608215 for the 10196
## 1592494 to be 9457
## 824286 is a 8989
## 762246 I have 8917
Finally, we get to the distribution of the three grams:
The top 10 3-grams are:
## ngram x
## 1593840 I do not 3105
## 2350721 one of the 1845
## 242452 a lot of 1713
## 1589229 I am not 1552
## 1598767 I have been 1291
## 1593382 I did not 1248
## 1793956 It is a 1149
## 3309761 to be a 1049
## 1058094 do not know 1035
## 1793017 it is a 965
How many unique words are there in the training data?
## [1] 158512
How many words cover 50% of the training data?
## [1] 156
How many words cover 90% of the training data?
## [1] 10100
Since we can cover most of the data using only a relatively small subset of words, this means that we can reduce the model size significantly without incurring too much loss of accuracy.
The plan for building a predictive model is to use a multi-step approach. First look up the two-word prefix in the list of 3-grams. If there are matching 3-grams, return the most probable k 3-grams for prediction. Should there be no matching 3-gram, then back off to the 2-gram list. If there is no match there either, then use the 1-gram list to choose the most probable word as the predicted word.
Once we have a predictive model, we can use the test data to evaluate the performance of the model (perplexity, accuracy) and try to tweak the model in order to get the best compromise between model performance and resources required.