Arshad Ullah
Oct 17th, 2019
Presenting the Predict Word App, an app that will predict the next word in a sentence as you type it in. The app will show a choice of five words, you can click on any word to append it to your sentence. The app can be found in this link: https://azullah.shinyapps.io/predictWord/
Some of the unique features of the app are as follows.
Note: initially the app may take about 20-25 secs to load the datatables into the global environment. Once the data is loaded you will see the predicted words for the initial test sentence on the screen. The datatables are available across sessions once loaded. The response time thereafter is around 20 ms.
The basic language model used is the N-grams language model as explained in Chapter 3 of the Jurafsky and Martin Book on Natural Language Processing.
The model that was implemented is a 5-gram language model with “stupid backoff” method for backing off to lower order N-grams when a certain N-gram does not exist in the model corpora. A fixed alpha value of 0.4 was used to calculate the predicted probability.
The normalized probability of each N-gram starting from the Uni-gram to the Penta-gram was pre-calculated during the model building process using uni-grams, bi-grams and tri-grams as the basic building blocks. The predicted probability is calculated during the actual prediction function execution at runtime.
To reduce the model size (due to restrictions in shiny hosting server and developing machine limitations) N-gram higher than 3 (Quad-grams and Penta-grams) were limited to only terms/features with minimum frequency of 2.
The training data to build the final model sampled randomly about 20% of the Blogs and News and 10% of the Twitter training data.
The corpora for the model was obtained using the data provided by Swiftkey as part of the project.
File | Lines |
---|---|
en_US.blogs.txt | 899288 |
en_US.news.txt | 1010242 |
en_US.twitter.txt | 2360148 |
The above data was split into roughly 80% for training, 15% for evaluation and 5% for testing purposes.
An in-mermory Corpus (tm::VCorpus
) was created and some pre-processing and cleaning of the data was done (using tm
and quanteda
packages and regex
), such as:
lexicon::profanity_banned
word list The corpus was then tokenized into N-grams using quanteda:tokens
. The document term/feature matrix was then converted to a data.table
using a tidytext
dataframe as an intermediate step. The rest of the model building process uses the data.table
data structure, one for each N-gram. The final tables in the app are also datatables. It provides fast searches on indexed keys which is useful for the model performance. The final tables were stored as .RDS
files which uses gzip to compress the object data.The final disk space occupied by the model files was 124.5 MB
Extensive testing and benchmarking was done on the model using the test data extracted from the original files. Three types of testing and benchmarking were done:
Using 60,000 news, blogs and tweets the accuracy achieved was 22% on the top-5 words
Using 60,0000 news, blogs and tweets from the test dataset, following accuracy was achieved:
N-gram | Accuracy |
---|---|
Bi-grams | 19.35% |
Tri-grams | 49.74% |
Quad-grams | 68.46% |
Penta-grams | 69.24% |
Metric | Result |
---|---|
Overall top-3 score: | 16.94 % |
Overall top-1 precision | 13.21 % |
Overall top-3 precision | 20.09 % |
Average runtime: | 19.99 msec |
Number of predictions: | 63237 |
Total memory used: | 809.01 MB |
File Size on Disk : | 124.5 MB |