Overview

This report explores a large corpus of English text from blogs, news articles, and Twitter to build the foundation of a predictive text application — similar to the autocomplete feature on a smartphone keyboard.

The Data

The dataset comes from SwiftKey and contains three sources of English text. A 5% random sample was used for this analysis.

Summary of Raw Text Files
Source Lines Words Size_MB
Blogs 899288 37334131 267.8
News 1010206 34371031 269.8
Twitter 2360148 30373543 334.5

Most Frequent Words

Stopwords (the, a, is) are kept intentionally — they are critical for predicting natural language sequences.

Most Frequent Bigrams

Most Frequent Trigrams

Prediction Algorithm Plan

The prediction model uses a Stupid Backoff approach with n-grams:

  1. Given the last 3 words, search for a matching 4-gram
  2. If not found, back off to the last 2 words and search trigrams
  3. If not found, back off to the last word and search bigrams
  4. If nothing matches, return the most common words overall

This handles unseen word combinations gracefully without assigning zero probability to any input.

Shiny App Plan

The final app will: