Joao Martins
October 3rd, 2016
1: Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. (2007). Large language models in machine translation. In EMNLP/CoNLL 2007
2: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
In general, n-gram models are the simplest language model that is based on the idea of attributing probabilities to sentences. When predicting words, we pick the most probable (in our case, the 5 most probable) sentences that have one extra word than the given input. The probabilities are calculated as follows:
\[ S(w_i|w^{i-1} _{i-k+1}) = \left\{ \begin{array}{ll} \frac{count(w^{i}_{i-k+1})}{count(w^{i-1}_{i-k+1})} & \text{if}~count(w^{i}_{i-k+1}) > 0 \\ \lambda S(w_{i}|w^{i-1}_{i-k+2}) & \text{otherwise} \end{array} \right. \]
Here \( w^{j}_{i} \) means the words \( i \) through \( j \) of a sentence, and the function \( S(w_{i}|w^{b}_{a}) \) computes the probability of the sentence \( w_{a} ... w_{b} w_{i} \) — so the \( w_{i} \) words that output the 5 largest probabilities are our predicted (or candidate) words.
The final implementation uses tables of up to 5-grams and \( \lambda = 0.4 \) (as suggested in the original paper); higher order n-grams did not increase accuracy and used considerably more RAM. The “raw” n-gram and frequency tables, once complete, occupied ~200 MB of RAM in R. Several steps were taken to reduce this:
These two measures cut the used RAM to ~30MB, and also considerably decreased the execution time from ~2s to ~70 msec, without affecting accuracy. Given these numbers, this implementation could be used as a good starting point for an implementation targeted for mobile devices.
Overall top-3 score: 16.28 %
Overall top-1 precision: 11.77 %
Overall top-3 precision: 20.12 %
Average runtime: 70.16 msec
Number of predictions: 28464
Total memory used: 34.82 MB
Dataset details
Dataset "blogs" (599 lines, 14587 words, hash 14b3c593e543eb8b2932cf00b646ed653e336897a03c82098b725e6e1f9b7aa2)
Score: 16.00 %, Top-1 precision: 11.35 %, Top-3 precision: 19.94 %
Dataset "tweets" (793 lines, 14071 words, hash 7fa3bf921c393fe7009bc60971b2bb8396414e7602bb4f409bed78c7192c30f4)
Score: 16.55 %, Top-1 precision: 12.19 %, Top-3 precision: 20.30 %