This report analyses English-language text to build a word-prediction tool, much like the suggestions your phone shows as you type. We examined three sources of real writing, blog posts, news articles, and tweets, about 102 million words in total. The key finding is that language is dominated by a small set of common words, while most words are rare. Because of this, we can strip out the rare words and shrink the underlying model by 78% without losing any accuracy, making the tool faster and cheaper to run. Next, we will build a prediction engine that suggests the most likely next word, delivered through a simple keyboard-style autocomplete bar.
This report documents the exploratory analysis of the SwiftKey English text corpus, a collection of blog posts, news articles, and tweets provided for the Johns Hopkins Data Science Capstone. It covers the size of the data, which words and phrases appear most often, how much of the language a small vocabulary can cover, and a plain-English outline of what we will build next.
The data comes from three English text files representing three styles of writing: long-form blogs, formal news, and short social-media posts (Twitter).
| File | Size (MB) | Lines | Words |
|---|---|---|---|
| en_US.blogs.txt | 200.4 | 899,288 | 37,334,131 |
| en_US.news.txt | 196.3 | 1,010,242 | 34,372,531 |
| en_US.twitter.txt | 159.4 | 2,360,148 | 30,373,583 |
| Total | 556.1 | 4,269,678 | 102,080,245 |
Twitter has by far the most lines, nearly 2.4 million, but the fewest words per line, reflecting how short tweets are. Blogs have fewer entries but much longer posts; news sits in between. Why this matters: covering all three styles helps the tool predict well whether someone is writing casually or formally.
For this analysis, a 0.5% random sample of each file was used to keep rendering time manageable. The text was lowercased and stripped of numbers, punctuation, and URLs before being split into words.
The chart below shows the 15 most common single words. The longer the bar, the more often that word appears. The top words are almost all small connecting words, “the”, “to”, “and”, “a”, “of”. Why this matters: these words are so common that getting them right is essential to feeling accurate.
Two-word combinations, pairs of words that often appear together, are led by “of the”, “in the”, and “to the”. Why this matters: these pairs let the tool guess the next word once it has seen just one word of context.
Three-word sequences give richer context. Phrases like “one of the”, “a lot of”, and “as well as” appear most often. Why this matters: longer phrases let the tool make smarter, more specific predictions.
Most of the text is made up of just a handful of common words, like “the” and “and”, while thousands of unusual words appear very rarely. This pattern is so common it has a name: Zipf’s Law (a pattern where a small number of common words account for most of all language use).
This has a practical payoff. Word combinations that appear only once (“singletons”) can be removed without any meaningful loss of prediction accuracy, while shrinking the model dramatically. In our full corpus build, removing these one-off combinations cut the model from approximately 70 MB to 15.7 MB, a 78% reduction.
In practical terms, this means our model can become 78% smaller without losing any accuracy, making it faster and cheaper to run.
Our prediction tool will suggest the next word based on patterns it finds in the training data, always falling back to the most common word if no pattern matches. In plain terms: it looks for words that usually follow the last few words you’ve typed; if it finds no match, it backs up and looks at fewer words, so a suggestion is always available.
The tool will appear as a mobile-style keyboard autocomplete bar that suggests the next word as you type, with live suggestions (no submit button), the top three guesses ranked by confidence, and number-key shortcuts (1/2/3) to accept a word.
The model uses pre-built frequency tables for 1-grams through 4-grams. “Stupid Backoff” (Brants et al., 2007) is a prediction method that looks for matching word patterns and, if none exist, simplifies the search step by step. At query time, given the user’s input phrase:
This guarantees a prediction is always returned, with sub-millisecond lookup time thanks to O(1) dictionary lookups. Note: “n-grams” simply means word sequences, a 2-gram is a word pair, a 3-gram a three-word phrase, and so on.
In plain English, here’s what matters: