Model Overview

A very efficient, fast and accurate next word prediction (NLP) model built using R and published using Shiny. The model utilizes a 5-gram backoff algorithm with intelligent two-word chaining for and improved user experience.

Features

  • 28.2% top-3 accuracy on held-out test data
  • 14.8 MB model size - fits on any smartphone
  • ~30ms median prediction time - wait time is completely imperceptible to users
  • Smart two-word chaining - e.g. predicts “the beach” instead of just “the”
  • 102 million-word training corpus - trained on twitter (X), news and blog text data in US English.

Model Creation and Approach

Various model approaches (e.g. Stupid Backoff Smoothing, Content-Word Biased Ensemble) were considered and/or tested and rejected.

Varying sample sizes and parameter configurations were attempted (e.g. 3-gram, 4-gram, minimal pruning [min_freq=1]), with this final model performing best.

Models were trained on up to 70% of the data and performance evaluated on a 15% held-out test set. This model was chosen for its speed and accuracy, as well its very small size (<15MB).

Model Performance Comparison

Actual data from a small subset of tested model approaches (12 total models tested).

Model Sample % N-gram Top-3 Accuracy Size (MB) Speed (ms)
Small 10% 4-gram 23.0% 2.1 5.0
Balanced 50% 4-gram 26.7% 9.0 24.7
Production 70% 5-gram 28.2% 14.8 32.8

How It Works

The model uses a 5-gram backoff algorithm:

  1. Analyzes user input and extracts context (using up to 4 previous words)
  2. Attempts to match against 5-gram patterns (sequences of 5 words). This lookup is extremely fast thanks to implementation of hashing during the model building phase.
  3. If no match is found, the model “backs off” to shorter n-grams (5→4→3→2→1)
  4. To improve UI, if a prediction is a stopword, the model chains to predict the next word and gives a 2-word prediction.
  5. Finally, the model returns top 3 most likely continuations to the user input

System Requirements

  • R version: 4.0 or higher
  • RAM: 1 GB minimum (model + Shiny overhead)
  • Storage: ~20 MB (app + model)
  • Platform: Any (Windows, macOS, Linux)
  • Free tier compatible: Yes (shinyapps.io)

License & Usage

This project was created as part of the Johns Hopkins Data Science Capstone using the SwiftKey dataset provided by Coursera.

Code: Free to use and modify (app.R and associated scripts) - see LI link below.

Model: For educational and portfolio demonstration purposes

Please check Coursera’s terms of service regarding commercial use of capstone projects.

Author

Piotr (Peter) Cebo

Please contact me on LinkedIn if you’d like to collaborate!

Acknowledgments

  • Johns Hopkins University - Data Science Specialization

  • English language corpus (blogs, news, Twitter) provided by SwiftKey for the Johns Hopkins Data Science Capstone Project