Data Science Specialization Capstone Project - Johns Hopkins University - An NLP Model for next word(s) predictions

Model Overview

A very efficient, fast and accurate next word prediction (NLP) model built using R and published using Shiny. The model utilizes a 5-gram backoff algorithm with intelligent two-word chaining for and improved user experience.

Features

28.2% top-3 accuracy on held-out test data
14.8 MB model size - fits on any smartphone
~30ms median prediction time - wait time is completely imperceptible to users
Smart two-word chaining - e.g. predicts “the beach” instead of just “the”
102 million-word training corpus - trained on twitter (X), news and blog text data in US English.

LIVE DEMO LINK - An interactive web application (R Shiny)

Model Creation and Approach

Various model approaches (e.g. Stupid Backoff Smoothing, Content-Word Biased Ensemble) were considered and/or tested and rejected.

Varying sample sizes and parameter configurations were attempted (e.g. 3-gram, 4-gram, minimal pruning [min_freq=1]), with this final model performing best.

Models were trained on up to 70% of the data and performance evaluated on a 15% held-out test set. This model was chosen for its speed and accuracy, as well its very small size (<15MB).

Model Performance Comparison

Actual data from a small subset of tested model approaches (12 total models tested).

Model	Sample %	N-gram	Top-3 Accuracy	Size (MB)	Speed (ms)
Small	10%	4-gram	23.0%	2.1	5.0
Balanced	50%	4-gram	26.7%	9.0	24.7
Production	70%	5-gram	28.2%	14.8	32.8

How It Works

The model uses a 5-gram backoff algorithm:

Analyzes user input and extracts context (using up to 4 previous words)
Attempts to match against 5-gram patterns (sequences of 5 words). This lookup is extremely fast thanks to implementation of hashing during the model building phase.
If no match is found, the model “backs off” to shorter n-grams (5→4→3→2→1)
To improve UI, if a prediction is a stopword, the model chains to predict the next word and gives a 2-word prediction.
Finally, the model returns top 3 most likely continuations to the user input

System Requirements

R version: 4.0 or higher
RAM: 1 GB minimum (model + Shiny overhead)
Storage: ~20 MB (app + model)
Platform: Any (Windows, macOS, Linux)
Free tier compatible: Yes (shinyapps.io)

License & Usage

This project was created as part of the Johns Hopkins Data Science Capstone using the SwiftKey dataset provided by Coursera.

Code: Free to use and modify (app.R and associated scripts) - see LI link below.

Model: For educational and portfolio demonstration purposes

Please check Coursera’s terms of service regarding commercial use of capstone projects.

Author

Piotr (Peter) Cebo

Please contact me on LinkedIn if you’d like to collaborate!

Acknowledgments

Johns Hopkins University - Data Science Specialization
English language corpus (blogs, news, Twitter) provided by SwiftKey for the Johns Hopkins Data Science Capstone Project