Swiftkey Next-word Prediction App Pitch

Emmanuel Benyeogor

2026-02-17

SNxWoP Pitch

Goal: To predict the next word from an input phrase, similar to smart phone keypads.

Deliverables: - Deployed Shiny app (next-word prediction) - Lightweight, responsive model suitable for deployment - Reproducible workflow and documentation

Problem and Motivation

Typing on mobile devices is slow and error-prone.

A smart keypad on mobile phones improves typing by predicting the next word given context, for example:

Input: “I went to the”
Predictions: gym, store, restaurant

Objective: Build a predictive text model and deploy it as a Shiny application as part of the Capstone project for the Data Science Course from JHU on Coursera.

Data and Key EDA Findings

Training data (HC Corpora, English): - Blogs, News, Twitter (millions of lines)

Key finding: - Sources differ substantially: Twitter lines are short, while blog lines can be extremely long. - This motivates: - Sampling for training efficiency - Consistent cleaning and tokenization - Compact model storage for fast runtime

Example quiz check result: Twitter love/hate line ratio ≈ 5

Model Approach: N-grams with Backoff

Model: Frequency-based n-gram language model
- 2-grams, 3-grams, 4-grams built from cleaned text

Backoff prediction strategy: 1. Use last 3 words → search 4-grams
2. If not found → last 2 words → search 3-grams
3. If not found → last 1 word → search 2-grams
4. If not found → fallback to a common default (e.g., “the”)

Pruning: Remove rare n-grams to reduce model size and improve latency.

Product Demo and Results

Shiny app behavior: - User inputs a phrase - App returns the top 3 predicted next words

Example outputs from the model: - “i love”you, the, it - “thank you for”the, your, following - “going to the”movies, beach, gym

Links: - Shiny app: https://chidemannie.shinyapps.io/Swiftkey_Next_Word_Predictor/ - GitHub repo: https://github.com/chidemannie/swiftkey-capstone - RPubs EDA report: https://rpubs.com/chidemannie/1398186

Future improvements: - Smoothing (e.g., Laplace/Katz), profanity filtering, stronger tokenization rules

[1] 2