- We pre-computed all the scores from 5-grams down to uni-grams using previous equation (for “stupid Back-off”) on our language model.
- All results were stored in data table (using R
data.table package)
- We then pruned the results. All entries with a frequency count of less than 3 were removed.
- A trade-off between efficiency and the size of the data model.
- Our prediction need to first load the data model into memory (this takes a few seconds), once loaded and
- Given the previous four words (context) from user input, the application re-actively starts a look-up in the 5-grams portion of the model
- If we find enough matching entries, we can return the result:
- top 5 entries with their score to chose from in decreasing order of score
- Otherwise if no match found or not enough matches (strictly less then 5), we do a look up in 4-grams portion of the model.
- this process is repeated until we get our 5 predictions down to the uni-gram portion of the model.
- If there were no match at all, we then return the top 5 uni-grams (by relative frequency)
- Using benchmark provided (cf. reference section), we obtained:
Overall top-3 score: 18.51 %
Overall top-1 precision: 13.93 %
Overall top-3 precision: 22.49 %
Average runtime: 15.11 msec
Number of predictions: 28464
Total memory used: 152.47 MB