Exploratory Analysis of the HC Corpora

Synopsis

This report summarises the exploratory analysis of a large English text corpus provided by SwiftKey and Coursera. The data comes from three sources — blogs, news articles, and Twitter — and will be used to build a predictive text model that suggests the next word as a user types. Below we present corpus- level statistics, word and n-gram frequency distributions, and our plan for the prediction algorithm.

Loading the Data

The corpus consists of three plain-text files in the en_US locale. We read all lines and sample 50,000 lines per file for a representative yet fast analysis.

Corpus Summary

The table below shows the size of each raw file, the total number of lines, and basic token statistics computed from our 50,000-line sample.

Corpus summary statistics (en_US)
Source	File Size (MB)	Total Lines	Lines Sampled	Word Count	Unique Words	Avg Words/Line
Blogs	200.4	899,288	50,000	2,045,541	76,854	41.0
News	196.3	1,010,242	50,000	1,662,781	71,445	33.3
Twitter	159.4	2,360,148	50,000	615,522	38,663	12.3

Key take-away: Blogs tend to have longer entries, news articles cluster around 30-40 words, and tweets are short by design. Twitter contributes the most lines but the fewest words per line.

Exploratory Analysis

Distribution of Line Lengths

The histograms below show how many words appear per line in each source. The shapes are strikingly different, reflecting the nature of each medium.

Most Frequent Words

The top 20 words across all three sources are dominated by common English function words (the, to, and, a, of). This is consistent with Zipf’s Law, which predicts that a small number of words account for most of the text.

Most Frequent Word Pairs (Bigrams) and Trigrams

Bigrams and trigrams reveal common phrases in the data. The top bigrams are largely function-word pairs (of the, in the, to the), while the top trigrams start to show more meaningful expressions that differ by source.

Notice how Twitter trigrams (thanks for the, looking forward to, thank you for) are conversational, while news trigrams (according to the, the united states) are more formal.

Word Frequency Distribution (Zipf’s Law)

Plotting word rank against frequency on a log-log scale produces the characteristic near-linear curve predicted by Zipf’s Law. The most frequent word (the) appears hundreds of thousands of times, while the vast majority of words appear only once or twice.

Vocabulary Coverage

A practical question for any prediction model is: how many unique words do we need to cover most of the text?

Only 139 words are needed to cover 50% of all word occurrences, and 7,993 words cover 90%. The full vocabulary contains 126,375 unique words, so the top 6.3% of the dictionary handles 90% of usage. This has a direct implication for model size: we can aggressively prune rare words with minimal impact on prediction quality.

Key Findings

The three sources are very different. Blogs are long-form and varied, news is structured and formal, and Twitter is short and conversational. A good model should learn from all three.
Word frequencies follow Zipf’s Law. A tiny fraction of the vocabulary accounts for most of the text, which means frequency-based pruning is highly effective.
N-gram patterns differ by source. Twitter favours social phrases (thanks for the, looking forward to), while news uses institutional language (the united states, according to the). The prediction model should capture these context-dependent patterns.
Coverage is efficient. Roughly 7,993 unique words cover 90% of all word instances, suggesting a compact model is feasible for mobile deployment.

Plans for the Prediction Algorithm

We plan to build a Stupid Backoff n-gram model with the following design:

Component	Detail
Model type	N-gram language model (up to 4-grams)
Smoothing	Stupid Backoff (penalty factor 0.4 per backoff level)
Storage	Keyed `data.table` objects with prefix + predicted-word columns
Pruning	Drop n-grams seen only once; keep top 5 predictions per context
Deployment	Interactive Shiny app on shinyapps.io

How it works: When the user types a phrase, the app extracts the last 1-3 words and looks up matching contexts in the 4-gram table first. If no match is found, it “backs off” to the 3-gram table (with a score penalty), then the 2-gram table, and finally falls back to the most common words. This approach is fast (pure table lookups), memory-efficient, and well-suited for a Shiny app running on limited server resources.

Evaluation plan: We will hold out 20% of the corpus as a test set and measure top-1 and top-3 prediction accuracy — i.e., how often the correct next word appears among the model’s top suggestions.