A very efficient, fast and accurate next word prediction (NLP) model built using R and published using Shiny. The model utilizes a 5-gram backoff algorithm with intelligent two-word chaining for and improved user experience.
Various model approaches (e.g. Stupid Backoff Smoothing, Content-Word Biased Ensemble) were considered and/or tested and rejected.
Varying sample sizes and parameter configurations were attempted (e.g. 3-gram, 4-gram, minimal pruning [min_freq=1]), with this final model performing best.
Models were trained on up to 70% of the data and performance evaluated on a 15% held-out test set. This model was chosen for its speed and accuracy, as well its very small size (<15MB).
Actual data from a small subset of tested model approaches (12 total models tested).
Model | Sample % | N-gram | Top-3 Accuracy | Size (MB) | Speed (ms) |
---|---|---|---|---|---|
Small | 10% | 4-gram | 23.0% | 2.1 | 5.0 |
Balanced | 50% | 4-gram | 26.7% | 9.0 | 24.7 |
Production | 70% | 5-gram | 28.2% | 14.8 | 32.8 |
The model uses a 5-gram backoff algorithm:
This project was created as part of the Johns Hopkins Data Science Capstone using the SwiftKey dataset provided by Coursera.
Code: Free to use and modify (app.R and associated scripts) - see LI link below.
Model: For educational and portfolio demonstration purposes
Please check Coursera’s terms of service regarding commercial use of capstone projects.
Johns Hopkins University - Data Science Specialization
English language corpus (blogs, news, Twitter) provided by SwiftKey for the Johns Hopkins Data Science Capstone Project