Predictive Text - Exploratory Analysis

Overview

This report explores a large corpus of English text from blogs, news articles, and Twitter to build the foundation of a predictive text application — similar to the autocomplete feature on a smartphone keyboard.

The Data

The dataset comes from SwiftKey and contains three sources of English text. A 5% random sample was used for this analysis.

Summary of Raw Text Files
Source	Lines	Words	Size_MB
Blogs	899288	37334131	267.8
News	1010206	34371031	269.8
Twitter	2360148	30373543	334.5

Most Frequent Words

Stopwords (the, a, is) are kept intentionally — they are critical for predicting natural language sequences.

Most Frequent Bigrams

Most Frequent Trigrams

Prediction Algorithm Plan

The prediction model uses a Stupid Backoff approach with n-grams:

Given the last 3 words, search for a matching 4-gram
If not found, back off to the last 2 words and search trigrams
If not found, back off to the last word and search bigrams
If nothing matches, return the most common words overall

This handles unseen word combinations gracefully without assigning zero probability to any input.

Shiny App Plan

The final app will:

Accept user text input in real time
Predict the next 3 most likely words instantly
Run efficiently within Shiny’s memory limits by keeping only n-grams that appear 2 or more times in the corpus