SwiftKey Text Prediction Slide Deck

Page 1: Overview and Motivation

This project builds a next-word prediction app using the SwiftKey dataset (blogs and news).
The goal is to create a lightweight, fast, and reasonably accurate prediction model using NLP tools in R.
Due to memory/runtime limitations:
- Only 50% of blogs and news data was used.
- Twitter data was excluded (too noisy, informal, and interfered with clean predictions).

The model is based on an n-gram backoff algorithm:
- Uses trigrams → bigrams → unigrams when higher-order matches are unavailable.
- No smoothing beyond backoff logic.
Key Features:
- Text is cleaned: punctuation, numbers, stopwords removed.
- Tokenization is performed using tidytext.
- Backoff is fast and interpretable.
Despite its simplicity, the model achieves a perplexity of 35, indicating strong predictive performance on clean input.

The Shiny app takes a user-entered phrase and returns predicted next word(s).
Features:
- Real-time predictions with a simple UI.
- Displays the matched n-gram tier (tri-, bi-, or unigram).
Instructions:
1. Type in a phrase like “The sun is”.
2. Press “Predict” — the app suggests next word(s).
3. You can repeat and build phrases interactively.

Runtime tradeoff: Only half the corpus used for faster loading.
Model simplicity: No deep learning; just n-grams and backoff.
Introduced noise: Some random n-grams added to test generalization — this slightly hindered top-1 accuracy but improved coverage.
Exclusion of Twitter: Intentional, as its informal language hurt precision.
However:
- The app achieves high accuracy.
- Lightweight, low-perplexity design suitable for mobile or embedded use.

The app is simple, fast, and robust.
The algorithm, though basic, delivers high predictive power.
User experience is smooth — clean interface, responsive predictions.
Novelty:
- Purposefully excluded noisy data and added adversarial n-grams.
- Achieved balance between speed and accuracy using classic methods.

I would definitely hire this person to build scalable, reliable NLP prototypes at my data science startup.