This report explores a large corpus of English text from blogs, news articles, and Twitter to build the foundation of a predictive text application — similar to the autocomplete feature on a smartphone keyboard.
The dataset comes from SwiftKey and contains three sources of English text. A 5% random sample was used for this analysis.
| Source | Lines | Words | Size_MB |
|---|---|---|---|
| Blogs | 899288 | 37334131 | 267.8 |
| News | 1010206 | 34371031 | 269.8 |
| 2360148 | 30373543 | 334.5 |
Stopwords (the, a, is) are kept intentionally — they are critical for predicting natural language sequences.
The prediction model uses a Stupid Backoff approach with n-grams:
This handles unseen word combinations gracefully without assigning zero probability to any input.
The final app will: