Our Word Predictor project will employ a statistical model based on the fact that the words don’t follow each other in any arbitrary order (at least not with expressing some meaning), and the probability of a word at a given position can be estimated by considering the other words preceding it. For example, the word ‘shuttle’ is more likely to occur after the word ‘space’ or after ‘airport’, than after other words.
This idea would produce kind of a dictionary that tells us that after a given pre-text which words are expected to follow with what probability.
Naturally, this look-behind doesn’t have to be restricted to one preceding word, longer pre-texts describe the situation more precisely, but it comes for a price: the dictionary will grow very quickly.
Therefore it is essential that we can decide how big pre-texts are worth considering in each case, and don’t grow them when we wouldn’t gain significant improvement by it.
The source of this document, including the R codes are available on GitHub.
For training our model we use 3 text corpora: one from blog comments, one from newsfeeds and one from tweets.
The raw data must be pre-processed even for an exploratory analysis. The steps we do here will require further refinement when we have gained some detailed information, but for a start they’ll do.
When considering the sequences of words, it only makes sense to process words that semantically belong together, we can’t really predict the start of one sentence from the end of the previous one.
Similarly, we can’t predict a part of a compound sentence from the previous part either, so we must first split each entry to such sub-sequence along the usual punctuations.
Then, to reduce the task we can coalesce all forms of the same word into one stem, so for example all of ‘types’, ‘typed’, ‘typing’ becomes simply ‘typ’.
Even after pre-processing, generating all 3-grams or 4-grams still would be overwhelming, so we will need some progressive filtering sooner or later.
If a word occurs only 5 times by itself, then it can occur only at most in 10 word-pairs (5 places at the beginning, 5 places at the end).
If a word pair occurs only 5 times, then it can occur only at most in 10 word-triplets (5 places at the beginning, 5 places at the end).
As a general rule: If an N-gram is insignificantly rare, then so will be all (N+1)-grams that contain it.
Therefore we count the words (1-grams) first, then discard the rarest ones (say, up to 5% of all occurences), then we need to collect only those 2-grams that are based on the remaining words, and so on.
Suppose we have an 3-gram ‘A-B-C’, and we collect a statistics on what ‘-D’ words follow it with what probability.
Then we collect the similar statistics for only its trailing 2-gram ‘B-C’.
If the probabilities after ‘A-B-C’ are roughly the same as after only ‘B-C’, then there is no point in extending it to 3-gram with ‘A-’.
If the distribution of the followers of an N-gram are close to the followers of its trailing (N-1)-gram, then this N-gram brings no new information, and can be discarded.
As this will require comparing distributions (against some p-value limit, for example) by each N-gram, this step will require the most processing capacity.
At this stage the following pre-processing steps have been completed:
The distribution of the most frequent 90% of words in the corpora shows that these top words are unproportionally frequent.
The coverage plot seems to confirm the hypothesis that
The most frequent words seem to correlate between the sources:
Blogs
## # A tibble: 214,690 x 3
## word n coverage
## <chr> <int> <dbl>
## 1 the 1860612 0.0490
## 2 and 1094539 0.0778
## 3 to 1069563 0.106
## 4 a 903190 0.130
## 5 of 876849 0.153
## 6 i 833477 0.175
## 7 in 598924 0.191
## 8 it 518326 0.204
## 9 that 484487 0.217
## 10 is 432735 0.228
## 11 for 363882 0.238
## 12 you 314306 0.246
## 13 with 286774 0.254
## 14 was 278355 0.261
## 15 on 276712 0.268
## # ... with 2.147e+05 more rows
News
## # A tibble: 184,512 x 3
## word n coverage
## <chr> <int> <dbl>
## 1 the 1974498 0.0563
## 2 to 906184 0.0821
## 3 a 893002 0.108
## 4 and 889529 0.133
## 5 of 774527 0.155
## 6 in 679429 0.174
## 7 that 371693 0.185
## 8 for 353916 0.195
## 9 it 343607 0.205
## 10 is 284250 0.213
## 11 on 270000 0.221
## 12 with 254820 0.228
## 13 said 250434 0.235
## 14 he 250432 0.242
## 15 was 228972 0.249
## # ... with 1.845e+05 more rows
## # A tibble: 278,480 x 3
## word n coverage
## <chr> <int> <dbl>
## 1 the 937797 0.0311
## 2 to 788900 0.0572
## 3 i 727682 0.0813
## 4 a 614906 0.102
## 5 you 549470 0.120
## 6 and 438723 0.134
## 7 it 418270 0.148
## 8 for 385440 0.161
## 9 in 380832 0.174
## 10 of 359696 0.186
## 11 is 358897 0.197
## 12 my 292117 0.207
## 13 on 278249 0.216
## 14 that 277033 0.225
## 15 be 211839 0.233
## # ... with 2.785e+05 more rows
Fortunately the information-gain-based filtering will eliminate most of the rare words (including the proper nouns, for example), so our final dictionary won’t be impractically large.
Next we will implement the n-gram generation using tokenizer::tokenize_ngrams(...), and the measurement of the information gain.
As it involves manipulating large quantities of data, the data representation may require changes as well.