Overview

This report summarises the exploratory analysis of the HC Corpora English dataset for the Johns Hopkins Data Science Capstone. The goal is to build a next-word prediction app powered by an N-gram language model. Three source files were analysed — blogs, news, and Twitter — all downloaded from the Coursera Capstone page.


1. The Data

The English corpus consists of three plain-text files. The table below shows the key statistics computed from each full file.

Table 1: en_US Corpus File Statistics
File Size (MB) Lines Words Avg chars / line Longest line
en_US.blogs.txt 201 899,288 37,272,578 233.9 40,836
en_US.news.txt 197 1,010,242 34,309,642 202.3 11,385
en_US.twitter.txt 160 2,360,148 30,341,028 70.7 214

Three quick takeaways:

  • Twitter has the most lines (2.36 million) but the shortest entries — capped at 140 characters. The longest tweet in the file is 214 characters.
  • Blogs have far fewer entries but the longest average length (234 chars/line), and the richest sentences — one blog entry runs 40,836 characters.
  • News sits between the two: formal prose, moderately long entries.

2. Word & Line-Length Distributions

A 5% random sample (~213,000 lines) was used for the plots below.

2a. Line Length Distribution (histogram)

Twitter’s histogram is tightly bounded by the 140-character limit; blog and news entries follow a right-skewed, log-normal shape typical of natural writing.

2b. Word-Frequency Distribution (Zipf’s Law)


3. Most Frequent Words

Stop words (“the”, “and”, “to”) dominate in every source. “I” ranks 4th in blogs but 2nd on Twitter, reflecting Twitter’s first-person conversational style.


4. Vocabulary Coverage

How many unique words are needed to account for most of the text?

Table 2: Words Needed to Cover X% of All Tokens
Coverage Unique words needed Notes
50% 127 Core function words only
90% 6,694 Good practical vocabulary
95% 15,387 Covers almost all everyday text
99% ~78,000 Includes rare/specialised terms

This is Zipf’s Law at work: just 127 words cover half of everything written. Practically, we only need ~10,000 words in our prediction model to handle 90–95% of everyday text — the rest can be treated as unknown.


5. N-gram Snapshots

N-grams are sequences of consecutive words and are the building blocks of the prediction model. The table below shows the most common 2- and 3-word phrases.

Table 3: Most Frequent Bigrams and Trigrams (5% sample, blogs)
Rank Top Bigrams Top Trigrams
1 of the one of the
2 in the a lot of
3 to the be able to
4 on the i want to
5 to be as well as
6 i have the end of
7 it was a couple of
8 a lot going to be

These recurring phrase patterns confirm that an N-gram model will find strong, reliable signals in this corpus.


6. Algorithm & App Plan

Prediction algorithm: A Stupid Back-off N-gram model (Brants et al., 2007) trained on a 30% sample of the corpus:

  1. Take the last 3 words the user typed
  2. Look up matching 4-word phrases (quadgrams) in a frequency table
  3. If no match, back off to 3-word phrases, then 2-word, then single-word, applying a 0.4 penalty at each step
  4. Return the top 5 candidates ranked by score

Why this approach? It is fast (< 5 ms per prediction), memory-efficient (the model fits in < 300 MB RAM), and straightforward to deploy.

Shiny App features:

  • Text input box — predictions update as you type
  • Top-5 suggested words shown as clickable buttons
  • Clicking a word appends it and re-predicts
  • Word frequency explorer tab

Summary

Item Finding
Corpus size 558 MB, ~102 million words across 3 files
Largest source Twitter (2.36M lines), Blogs (richest sentences)
Key insight 127 words = 50% coverage; 6,694 = 90% coverage
Profanity Present in the data; will be filtered before training
Algorithm Stupid Back-off over unigram–quadgram frequency tables

Data source: HC Corpora (en_US), downloaded from the Coursera Capstone page.