Overview

This report documents the exploratory analysis of the SwiftKey English text corpus, a collection of blog posts, news articles, and tweets provided for the Johns Hopkins Data Science Capstone. The analysis covers basic corpus statistics, word and n-gram frequency distributions, and vocabulary coverage. It closes with a brief outline of the planned prediction algorithm and Shiny application.


1. Corpus Summary

The corpus consists of three English-language text files covering distinct registers of writing: long-form blogs, formal news, and short-form social media (Twitter).

File Size (MB) Lines Words
en_US.blogs.txt 200.4 899,288 37,334,131
en_US.news.txt 196.3 1,010,242 34,372,531
en_US.twitter.txt 159.4 2,360,148 30,373,583
Total 556.1 4,269,678 102,080,245

Twitter has the most lines by far, nearly 2.4 million, but the fewest words per line, reflecting the short-form nature of tweets. Blogs have far fewer entries but substantially longer average length. News sits in between.


2. Frequency Analysis

For this analysis, a 0.5% random sample of each file was used to keep rendering time manageable. The text was lowercased and stripped of numbers, punctuation, and URLs before tokenization.

Unigrams (Single Words)

The most frequent words are almost entirely function words, “the”, “to”, “and”, “a”, “of”. This is expected and consistent with Zipf’s Law: a small number of words account for a disproportionately large share of all word usage.

Bigrams (Word Pairs)

Two-word combinations like “of the”, “in the”, and “to the” dominate. These high-frequency pairs form the foundation of next-word predictions when a single preceding word is available.

Trigrams (Word Triplets)

Three-word sequences provide richer context. Phrases like “one of the”, “a lot of”, and “as well as” appear most often, these will be directly matched against the model’s trigram and quadgram lookup tables at prediction time.


3. Vocabulary & Coverage

The word frequency distribution is heavily skewed, a pattern known as Zipf’s Law. A small number of words account for the majority of all usage, while a long tail of rare words appears very infrequently.

This has a practical implication: n-grams that appear only once (“singletons”) can be safely removed without meaningful loss of predictive accuracy, while dramatically reducing model size. In our full 5% corpus build, pruning singletons reduced the model from approximately 70 MB to 15.7 MB, a 78% reduction.


4. Plan: Algorithm & Application

Prediction algorithm, Stupid Backoff (Brants et al., 2007)

The model will use pre-built frequency tables for 1-grams through 4-grams. At query time, given the user’s input phrase:

  1. Extract the last 3 words and look up matching 4-grams
  2. If no match, back off to the last 2 words (3-gram table), penalising the score by λ = 0.4
  3. If still no match, back off to the last word (2-gram table)
  4. Final fallback: return the most frequent unigrams

This guarantees a prediction is always returned, with sub-millisecond lookup time thanks to O(1) dictionary lookups.

Application

The prediction engine will be deployed as a Shiny web application with live typing suggestions (no submit button), top-3 predictions ranked by confidence score, and keyboard shortcuts (1/2/3) to append words, designed to feel like a real mobile keyboard autocomplete bar.