This report summarises the exploratory analysis of the HC Corpora English dataset for the Johns Hopkins Data Science Capstone. The goal is to build a next-word prediction app powered by an N-gram language model. Three source files were analysed — blogs, news, and Twitter — all downloaded from the Coursera Capstone page.
The English corpus consists of three plain-text files. The table below shows the key statistics computed from each full file.
| File | Size (MB) | Lines | Words | Avg chars / line | Longest line |
|---|---|---|---|---|---|
| en_US.blogs.txt | 201 | 899,288 | 37,272,578 | 233.9 | 40,836 |
| en_US.news.txt | 197 | 1,010,242 | 34,309,642 | 202.3 | 11,385 |
| en_US.twitter.txt | 160 | 2,360,148 | 30,341,028 | 70.7 | 214 |
Three quick takeaways:
A 5% random sample (~213,000 lines) was used for the plots below.
Twitter’s histogram is tightly bounded by the 140-character limit; blog and news entries follow a right-skewed, log-normal shape typical of natural writing.
Stop words (“the”, “and”, “to”) dominate in every source. “I” ranks 4th in blogs but 2nd on Twitter, reflecting Twitter’s first-person conversational style.
How many unique words are needed to account for most of the text?
| Coverage | Unique words needed | Notes |
|---|---|---|
| 50% | 127 | Core function words only |
| 90% | 6,694 | Good practical vocabulary |
| 95% | 15,387 | Covers almost all everyday text |
| 99% | ~78,000 | Includes rare/specialised terms |
This is Zipf’s Law at work: just 127 words cover half of everything written. Practically, we only need ~10,000 words in our prediction model to handle 90–95% of everyday text — the rest can be treated as unknown.
N-grams are sequences of consecutive words and are the building blocks of the prediction model. The table below shows the most common 2- and 3-word phrases.
| Rank | Top Bigrams | Top Trigrams |
|---|---|---|
| 1 | of the | one of the |
| 2 | in the | a lot of |
| 3 | to the | be able to |
| 4 | on the | i want to |
| 5 | to be | as well as |
| 6 | i have | the end of |
| 7 | it was | a couple of |
| 8 | a lot | going to be |
These recurring phrase patterns confirm that an N-gram model will find strong, reliable signals in this corpus.
Prediction algorithm: A Stupid Back-off N-gram model (Brants et al., 2007) trained on a 30% sample of the corpus:
Why this approach? It is fast (< 5 ms per prediction), memory-efficient (the model fits in < 300 MB RAM), and straightforward to deploy.
Shiny App features:
| Item | Finding |
|---|---|
| Corpus size | 558 MB, ~102 million words across 3 files |
| Largest source | Twitter (2.36M lines), Blogs (richest sentences) |
| Key insight | 127 words = 50% coverage; 6,694 = 90% coverage |
| Profanity | Present in the data; will be filtered before training |
| Algorithm | Stupid Back-off over unigram–quadgram frequency tables |
Data source: HC Corpora (en_US), downloaded from the Coursera Capstone page.