This report summarises the exploratory analysis of a large English text corpus provided by SwiftKey and Coursera. The data comes from three sources — blogs, news articles, and Twitter — and will be used to build a predictive text model that suggests the next word as a user types. Below we present corpus- level statistics, word and n-gram frequency distributions, and our plan for the prediction algorithm.
The corpus consists of three plain-text files in the
en_US locale. We read all lines and sample 50,000
lines per file for a representative yet fast analysis.
The table below shows the size of each raw file, the total number of lines, and basic token statistics computed from our 50,000-line sample.
| Source | File Size (MB) | Total Lines | Lines Sampled | Word Count | Unique Words | Avg Words/Line |
|---|---|---|---|---|---|---|
| Blogs | 200.4 | 899,288 | 50,000 | 2,045,541 | 76,854 | 41.0 |
| News | 196.3 | 1,010,242 | 50,000 | 1,662,781 | 71,445 | 33.3 |
| 159.4 | 2,360,148 | 50,000 | 615,522 | 38,663 | 12.3 |
Key take-away: Blogs tend to have longer entries, news articles cluster around 30-40 words, and tweets are short by design. Twitter contributes the most lines but the fewest words per line.
The histograms below show how many words appear per line in each source. The shapes are strikingly different, reflecting the nature of each medium.
The top 20 words across all three sources are dominated by common English function words (the, to, and, a, of). This is consistent with Zipf’s Law, which predicts that a small number of words account for most of the text.
Bigrams and trigrams reveal common phrases in the data. The top bigrams are largely function-word pairs (of the, in the, to the), while the top trigrams start to show more meaningful expressions that differ by source.
Notice how Twitter trigrams (thanks for the, looking forward to, thank you for) are conversational, while news trigrams (according to the, the united states) are more formal.
Plotting word rank against frequency on a log-log scale produces the characteristic near-linear curve predicted by Zipf’s Law. The most frequent word (the) appears hundreds of thousands of times, while the vast majority of words appear only once or twice.
A practical question for any prediction model is: how many unique words do we need to cover most of the text?
Only 139 words are needed to cover 50% of all word occurrences, and 7,993 words cover 90%. The full vocabulary contains 126,375 unique words, so the top 6.3% of the dictionary handles 90% of usage. This has a direct implication for model size: we can aggressively prune rare words with minimal impact on prediction quality.
The three sources are very different. Blogs are long-form and varied, news is structured and formal, and Twitter is short and conversational. A good model should learn from all three.
Word frequencies follow Zipf’s Law. A tiny fraction of the vocabulary accounts for most of the text, which means frequency-based pruning is highly effective.
N-gram patterns differ by source. Twitter favours social phrases (thanks for the, looking forward to), while news uses institutional language (the united states, according to the). The prediction model should capture these context-dependent patterns.
Coverage is efficient. Roughly 7,993 unique words cover 90% of all word instances, suggesting a compact model is feasible for mobile deployment.
We plan to build a Stupid Backoff n-gram model with the following design:
| Component | Detail |
|---|---|
| Model type | N-gram language model (up to 4-grams) |
| Smoothing | Stupid Backoff (penalty factor 0.4 per backoff level) |
| Storage | Keyed data.table objects with prefix + predicted-word
columns |
| Pruning | Drop n-grams seen only once; keep top 5 predictions per context |
| Deployment | Interactive Shiny app on shinyapps.io |
How it works: When the user types a phrase, the app extracts the last 1-3 words and looks up matching contexts in the 4-gram table first. If no match is found, it “backs off” to the 3-gram table (with a score penalty), then the 2-gram table, and finally falls back to the most common words. This approach is fast (pure table lookups), memory-efficient, and well-suited for a Shiny app running on limited server resources.
Evaluation plan: We will hold out 20% of the corpus as a test set and measure top-1 and top-3 prediction accuracy — i.e., how often the correct next word appears among the model’s top suggestions.