NextWord: Exploratory Data Analysis

Executive Summary
Overview
What We Found: Data Insights
What This Means: Vocabulary & Coverage
What We’ll Build Next
Key Takeaways

Executive Summary

This report analyses English-language text to build a word-prediction tool, much like the suggestions your phone shows as you type. We examined three sources of real writing, blog posts, news articles, and tweets, about 102 million words in total. The key finding is that language is dominated by a small set of common words, while most words are rare. Because of this, we can strip out the rare words and shrink the underlying model by 78% without losing any accuracy, making the tool faster and cheaper to run. Next, we will build a prediction engine that suggests the most likely next word, delivered through a simple keyboard-style autocomplete bar.

Overview

This report documents the exploratory analysis of the SwiftKey English text corpus, a collection of blog posts, news articles, and tweets provided for the Johns Hopkins Data Science Capstone. It covers the size of the data, which words and phrases appear most often, how much of the language a small vocabulary can cover, and a plain-English outline of what we will build next.

What We Found: Data Insights

The data comes from three English text files representing three styles of writing: long-form blogs, formal news, and short social-media posts (Twitter).

File	Size (MB)	Lines	Words
en_US.blogs.txt	200.4	899,288	37,334,131
en_US.news.txt	196.3	1,010,242	34,372,531
en_US.twitter.txt	159.4	2,360,148	30,373,583
Total	556.1	4,269,678	102,080,245

Twitter has by far the most lines, nearly 2.4 million, but the fewest words per line, reflecting how short tweets are. Blogs have fewer entries but much longer posts; news sits in between. Why this matters: covering all three styles helps the tool predict well whether someone is writing casually or formally.

For this analysis, a 0.5% random sample of each file was used to keep rendering time manageable. The text was lowercased and stripped of numbers, punctuation, and URLs before being split into words.

Most common single words

The chart below shows the 15 most common single words. The longer the bar, the more often that word appears. The top words are almost all small connecting words, “the”, “to”, “and”, “a”, “of”. Why this matters: these words are so common that getting them right is essential to feeling accurate.

Most common word pairs

Two-word combinations, pairs of words that often appear together, are led by “of the”, “in the”, and “to the”. Why this matters: these pairs let the tool guess the next word once it has seen just one word of context.

Most common three-word phrases

Three-word sequences give richer context. Phrases like “one of the”, “a lot of”, and “as well as” appear most often. Why this matters: longer phrases let the tool make smarter, more specific predictions.

What This Means: Vocabulary & Coverage

Most of the text is made up of just a handful of common words, like “the” and “and”, while thousands of unusual words appear very rarely. This pattern is so common it has a name: Zipf’s Law (a pattern where a small number of common words account for most of all language use).

This has a practical payoff. Word combinations that appear only once (“singletons”) can be removed without any meaningful loss of prediction accuracy, while shrinking the model dramatically. In our full corpus build, removing these one-off combinations cut the model from approximately 70 MB to 15.7 MB, a 78% reduction.

In practical terms, this means our model can become 78% smaller without losing any accuracy, making it faster and cheaper to run.

What We’ll Build Next

Our prediction tool will suggest the next word based on patterns it finds in the training data, always falling back to the most common word if no pattern matches. In plain terms: it looks for words that usually follow the last few words you’ve typed; if it finds no match, it backs up and looks at fewer words, so a suggestion is always available.

The tool will appear as a mobile-style keyboard autocomplete bar that suggests the next word as you type, with live suggestions (no submit button), the top three guesses ranked by confidence, and number-key shortcuts (1/2/3) to accept a word.

Technical appendix: prediction algorithm (Stupid Backoff)

The model uses pre-built frequency tables for 1-grams through 4-grams. “Stupid Backoff” (Brants et al., 2007) is a prediction method that looks for matching word patterns and, if none exist, simplifies the search step by step. At query time, given the user’s input phrase:

Extract the last 3 words and look up matching 4-grams.
If no match, back off to the last 2 words (3-gram table), penalising the score by λ = 0.4.
If still no match, back off to the last word (2-gram table).
Final fallback: return the most frequent single words.

This guarantees a prediction is always returned, with sub-millisecond lookup time thanks to O(1) dictionary lookups. Note: “n-grams” simply means word sequences, a 2-gram is a word pair, a 3-gram a three-word phrase, and so on.

Key Takeaways

In plain English, here’s what matters:

Our data has about 102 million words across blogs, news, and tweets, but most of it is made of just a few hundred common words.
We can remove rare words and keep full accuracy, this saves 78% of storage.
Our prediction tool will work like your phone’s keyboard suggestions: fast, simple, and always ready with a guess.