This report documents the exploratory analysis of the SwiftKey English text corpus, a collection of blog posts, news articles, and tweets provided for the Johns Hopkins Data Science Capstone. The analysis covers basic corpus statistics, word and n-gram frequency distributions, and vocabulary coverage. It closes with a brief outline of the planned prediction algorithm and Shiny application.
The corpus consists of three English-language text files covering distinct registers of writing: long-form blogs, formal news, and short-form social media (Twitter).
| File | Size (MB) | Lines | Words |
|---|---|---|---|
| en_US.blogs.txt | 200.4 | 899,288 | 37,334,131 |
| en_US.news.txt | 196.3 | 1,010,242 | 34,372,531 |
| en_US.twitter.txt | 159.4 | 2,360,148 | 30,373,583 |
| Total | 556.1 | 4,269,678 | 102,080,245 |
Twitter has the most lines by far, nearly 2.4 million, but the fewest words per line, reflecting the short-form nature of tweets. Blogs have far fewer entries but substantially longer average length. News sits in between.
For this analysis, a 0.5% random sample of each file was used to keep rendering time manageable. The text was lowercased and stripped of numbers, punctuation, and URLs before tokenization.
The most frequent words are almost entirely function words, “the”, “to”, “and”, “a”, “of”. This is expected and consistent with Zipf’s Law: a small number of words account for a disproportionately large share of all word usage.
Two-word combinations like “of the”, “in the”, and “to the” dominate. These high-frequency pairs form the foundation of next-word predictions when a single preceding word is available.
Three-word sequences provide richer context. Phrases like “one of the”, “a lot of”, and “as well as” appear most often, these will be directly matched against the model’s trigram and quadgram lookup tables at prediction time.
The word frequency distribution is heavily skewed, a pattern known as Zipf’s Law. A small number of words account for the majority of all usage, while a long tail of rare words appears very infrequently.
This has a practical implication: n-grams that appear only once (“singletons”) can be safely removed without meaningful loss of predictive accuracy, while dramatically reducing model size. In our full 5% corpus build, pruning singletons reduced the model from approximately 70 MB to 15.7 MB, a 78% reduction.
Prediction algorithm, Stupid Backoff (Brants et al., 2007)
The model will use pre-built frequency tables for 1-grams through 4-grams. At query time, given the user’s input phrase:
This guarantees a prediction is always returned, with sub-millisecond lookup time thanks to O(1) dictionary lookups.
Application
The prediction engine will be deployed as a Shiny web application with live typing suggestions (no submit button), top-3 predictions ranked by confidence score, and keyboard shortcuts (1/2/3) to append words, designed to feel like a real mobile keyboard autocomplete bar.