SwiftKey NLP: Exploratory Data Analysis

Overview

This report summarises the exploratory analysis of the HC Corpora English dataset for the Johns Hopkins Data Science Capstone. The goal is to build a next-word prediction app powered by an N-gram language model. Three source files were analysed — blogs, news, and Twitter — all downloaded from the Coursera Capstone page.

1. The Data

The English corpus consists of three plain-text files. The table below shows the key statistics computed from each full file.

Table 1: en_US Corpus File Statistics
File	Size (MB)	Lines	Words	Avg chars / line	Longest line
en_US.blogs.txt	201	899,288	37,272,578	233.9	40,836
en_US.news.txt	197	1,010,242	34,309,642	202.3	11,385
en_US.twitter.txt	160	2,360,148	30,341,028	70.7	214

Three quick takeaways:

Twitter has the most lines (2.36 million) but the shortest entries — capped at 140 characters. The longest tweet in the file is 214 characters.
Blogs have far fewer entries but the longest average length (234 chars/line), and the richest sentences — one blog entry runs 40,836 characters.
News sits between the two: formal prose, moderately long entries.

2. Word & Line-Length Distributions

A 5% random sample (~213,000 lines) was used for the plots below.

2a. Line Length Distribution (histogram)

Twitter’s histogram is tightly bounded by the 140-character limit; blog and news entries follow a right-skewed, log-normal shape typical of natural writing.

2b. Word-Frequency Distribution (Zipf’s Law)

3. Most Frequent Words

Stop words (“the”, “and”, “to”) dominate in every source. “I” ranks 4th in blogs but 2nd on Twitter, reflecting Twitter’s first-person conversational style.

4. Vocabulary Coverage

How many unique words are needed to account for most of the text?

Table 2: Words Needed to Cover X% of All Tokens
Coverage	Unique words needed	Notes
50%	127	Core function words only
90%	6,694	Good practical vocabulary
95%	15,387	Covers almost all everyday text
99%	~78,000	Includes rare/specialised terms

This is Zipf’s Law at work: just 127 words cover half of everything written. Practically, we only need ~10,000 words in our prediction model to handle 90–95% of everyday text — the rest can be treated as unknown.

5. N-gram Snapshots

N-grams are sequences of consecutive words and are the building blocks of the prediction model. The table below shows the most common 2- and 3-word phrases.

Table 3: Most Frequent Bigrams and Trigrams (5% sample, blogs)
Rank	Top Bigrams	Top Trigrams
1	of the	one of the
2	in the	a lot of
3	to the	be able to
4	on the	i want to
5	to be	as well as
6	i have	the end of
7	it was	a couple of
8	a lot	going to be

These recurring phrase patterns confirm that an N-gram model will find strong, reliable signals in this corpus.

6. Algorithm & App Plan

Prediction algorithm: A Stupid Back-off N-gram model (Brants et al., 2007) trained on a 30% sample of the corpus:

Take the last 3 words the user typed
Look up matching 4-word phrases (quadgrams) in a frequency table
If no match, back off to 3-word phrases, then 2-word, then single-word, applying a 0.4 penalty at each step
Return the top 5 candidates ranked by score

Why this approach? It is fast (< 5 ms per prediction), memory-efficient (the model fits in < 300 MB RAM), and straightforward to deploy.

Shiny App features:

Text input box — predictions update as you type
Top-5 suggested words shown as clickable buttons
Clicking a word appends it and re-predicts
Word frequency explorer tab

Summary

Item	Finding
Corpus size	558 MB, ~102 million words across 3 files
Largest source	Twitter (2.36M lines), Blogs (richest sentences)
Key insight	127 words = 50% coverage; 6,694 = 90% coverage
Profanity	Present in the data; will be filtered before training
Algorithm	Stupid Back-off over unigram–quadgram frequency tables

Data source: HC Corpora (en_US), downloaded from the Coursera Capstone page.