SwiftKey Next Word Prediction Using N-Gram Models

Saurabh Bhatt

PROJECT OVERVIEW

Objective:

  • Build a predictive text model using SwiftKey dataset
  • Predict the next word based on user input text
  • Deploy the model using a Shiny web application

Key Idea:

  • Use Natural Language Processing (NLP)
  • Apply n-gram language modeling for prediction

DATASET

Data Sources:

  • Blogs dataset, News dataset, and Twitter dataset

Dataset Summary:

  • Blogs: 899,288 lines
  • News: 1,010,242 lines
  • Twitter: 2,360,148 lines

Key Insight:

  • Common English words dominate the corpus
  • Frequently occurring bigrams and trigrams reveal language patterns
  • A relatively small vocabulary covers a large portion of total word occurrences

MODEL APPROACH

Modeling Technique:

  • Trigram model (primary prediction)
  • Bigram model (backoff strategy)

Workflow:

  1. Input text is tokenized
  2. Trigram model is checked first
  3. If no match, bigram model is used
  4. If still no match, default word is returned This ensures robust prediction even for unseen inputs.

Performance and Shiny App

Model Performance:

  • Prediction is near real-time (< 1 second)
  • Trigram model size: ~72.6 MB and Bigram model size: ~30.8 MB
  • Uses precomputed frequency-based lookup tables

Shiny Application:

  1. User enters a phrase and model processes input
  2. Next word is predicted instantly
  3. Output is displayed in web interface

Conclusion:

  • Successfully built an NLP-based predictive text system
  • Efficient and responsive Shiny application deployed