SwiftKey Capstone — Milestone Report (Tasks 1

Project Summary

The objective of this capstone is to build a next-word prediction model similar to the SwiftKey keyboard.
This report summarizes the work completed for Tasks 1–3 and is written to be understandable by a non-technical manager.

The completed tasks include:

Task 1: Getting and cleaning the data — sampling, cleaning, tokenization, profanity filtering
Task 2: Exploratory Data Analysis — word frequency analysis, n-grams, vocabulary coverage
Task 3: Modeling — unigram, bigram, trigram models with smoothing and backoff

Data Overview

The dataset used is the English HC Corpora dataset provided by Coursera. It contains three large text files:

en_US.blogs.txt — long-form blog posts
en_US.news.txt — formal news articles
en_US.twitter.txt — short, informal Twitter messages

Due to the large size of the dataset, a reproducible 0.5% random sample was used to make analysis computationally feasible.

Task 1 — Getting & Cleaning the Data

What was done

Loaded raw files using streaming to avoid memory issues
Created reproducible random samples from each file
Applied text cleaning: lowercase conversion, removal of URLs, numbers, non-ASCII characters
Removed profane words
Tokenized text into individual words

Basic Token Statistics (Top Words)

Blogs — Top Tokens

Rank	Token	Count
1	the	9278
2	to	5347
3	and	5263
4	a	4371
5	of	4326

News — Top Tokens

Rank	Token	Count
1	the	9778
2	to	4600
3	a	4490
4	and	4390
5	of	3940

Twitter — Top Tokens

Rank	Token	Count
1	the	4546
2	to	3902
3	i	3619
4	a	2990
5	you	2594

These results show a Zipf-like distribution where a small number of words account for a large portion of usage.

Foreign Language Detection

To estimate the presence of foreign-language text, the proportion of tokens containing non-ASCII characters was calculated.

Dataset	Estimated Foreign Word Ratio
Blogs	1.10%
News	1.77%
Twitter	3.18%

Twitter contains more noisy and mixed-language text compared to blogs and news.

N-gram Highlights

To capture word relationships, n-grams were generated.

Common Bigrams

of the
in the
to the

Common Trigrams

one of the
as well as
going to be

These patterns help improve next-word prediction accuracy.

Task 3 — Modeling Summary

A compact n-gram language model was constructed using the cleaned and tokenized data:

Unigram, bigram, and trigram frequency dictionaries
Prefix-based lookup tables for fast prediction
Add-k smoothing and Stupid Backoff for unseen phrases
Model optimized for memory efficiency and speed

Key Findings & Recommendations

Word frequencies follow a Zipf-like distribution
Twitter data is noisier and requires special handling
A small vocabulary covers a large portion of text
Smoothing and backoff significantly improve predictions

Next Steps

Train the final model on a larger sample
Implement the prediction logic in a Shiny application
Build a simple user interface for next-word prediction
Prepare a 5-slide pitch deck explaining the model and app

This milestone confirms readiness to proceed toward building the final predictive text application.

SwiftKey Capstone — Milestone Report (Tasks 1–3)

Sonal Kumari

2026-11-17