SwiftKey Next Word Prediction Using N-Gram Models

Saurabh Bhatt

PROJECT OVERVIEW

Objective:

Build a predictive text model using SwiftKey dataset
Predict the next word based on user input text
Deploy the model using a Shiny web application

Key Idea:

Use Natural Language Processing (NLP)
Apply n-gram language modeling for prediction

DATASET

Data Sources:

Blogs dataset, News dataset, and Twitter dataset

Dataset Summary:

Blogs: 899,288 lines
News: 1,010,242 lines
Twitter: 2,360,148 lines

Key Insight:

Common English words dominate the corpus
Frequently occurring bigrams and trigrams reveal language patterns
A relatively small vocabulary covers a large portion of total word occurrences

MODEL APPROACH

Modeling Technique:

Trigram model (primary prediction)
Bigram model (backoff strategy)

Workflow:

Input text is tokenized
Trigram model is checked first
If no match, bigram model is used
If still no match, default word is returned This ensures robust prediction even for unseen inputs.

Performance and Shiny App

Model Performance:

Prediction is near real-time (< 1 second)
Trigram model size: ~72.6 MB and Bigram model size: ~30.8 MB
Uses precomputed frequency-based lookup tables

Shiny Application:

User enters a phrase and model processes input
Next word is predicted instantly
Output is displayed in web interface

Conclusion:

Successfully built an NLP-based predictive text system
Efficient and responsive Shiny application deployed