Submitted by: bkamra56 Date: June
2026
Dataset: SwiftKey / HC Corpora — English (en_US)
This report presents an exploratory data analysis (EDA) of the English-language text data provided for the Johns Hopkins / Coursera Data Science Capstone. The dataset consists of three large plain-text corpora sourced from blogs, news articles, and Twitter. The ultimate goal of the capstone is to build a predictive text model — similar to the keyboard word-suggestion feature on a smartphone — that predicts the next word a user is likely to type based on the preceding word(s).
This milestone covers: data ingestion, summary statistics, tokenisation, n-gram frequency analysis, vocabulary coverage analysis, and a plan for building the predictive model.
The corpus files were provided by HC Corpora and filtered to
English (en_US). Each file contains one document (blog post, news
article, or tweet) per line, separated by \r\n.
| Source | Lines (rows) | File Size | Avg Words/Line | Avg Chars/Line |
|---|---|---|---|---|
| Blogs | 899,288 | ~200 MB | 42.3 | 228.7 |
| News | 1,010,242 | ~196 MB | 35.5 | 202.3 |
| 2,360,148 | ~159 MB | 13.1 | 68.5 | |
| Total | 4,269,678 | ~555 MB | — | — |
| Source | Estimated Words |
|---|---|
| Blogs | ~37,300,000 |
| News | ~34,121,000 |
| ~29,544,000 | |
| Total | ~101,000,000 |
Note: Counts are extrapolated from 100,000-line samples via proportional scaling and validated against file size.
Given the large file sizes (555 MB total), loading all three files entirely into memory at once would be impractical during model development. A stratified random sampling approach is recommended:
For this EDA, a representative sample of 100,000 lines per source was used (≈4–11% of each file).
| Source | Unique Words (100k-line sample) |
|---|---|
| Blogs | 90,870 |
| News | 86,311 |
| 53,820 |
Twitter has the smallest vocabulary despite having the most lines, consistent with its character-limited format promoting abbreviated, repetitive language. Blogs have the richest vocabulary, likely because authors write longer, more varied prose.
Blogs:
| Rank | Word | Count |
|---|---|---|
| 1 | the | 205,842 |
| 2 | and | 120,772 |
| 3 | to | 118,300 |
| 4 | a | 100,161 |
| 5 | of | 96,755 |
| 6 | i | 92,911 |
| 7 | in | 66,200 |
| 8 | that | 52,145 |
| 9 | it | 49,223 |
| 10 | is | 48,165 |
News:
| Rank | Word | Count |
|---|---|---|
| 1 | the | 195,700 |
| 2 | to | 89,963 |
| 3 | and | 88,584 |
| 4 | a | 88,556 |
| 5 | of | 76,488 |
| 6 | in | 67,057 |
| 7 | for | 35,032 |
| 8 | that | 34,470 |
| 9 | is | 28,317 |
| 10 | said | 24,777 |
Twitter:
| Rank | Word | Count |
|---|---|---|
| 1 | the | 39,735 |
| 2 | to | 33,110 |
| 3 | i | 30,750 |
| 4 | a | 26,006 |
| 5 | you | 23,270 |
| 6 | and | 18,542 |
| 7 | for | 16,502 |
| 8 | in | 16,181 |
| 9 | is | 15,461 |
| 10 | of | 15,160 |
Key observation: Twitter places “I” and “you” much higher — reflecting its conversational, first-person nature. News uniquely features “said” in the top 10, reflecting attribution-heavy journalism.
The corpus exhibits a classic Zipfian distribution — a small set of words accounts for the vast majority of tokens:
| Source | Words needed for 50% coverage | Words needed for 90% coverage |
|---|---|---|
| Blogs | 104 | 6,065 |
| News | 191 | 7,604 |
| 123 | 4,923 |
This means that roughly the top 6,000–8,000 words are sufficient to cover 90% of all tokens encountered. This is critical for model design: a vocabulary capped at ~10,000–15,000 words can achieve excellent coverage while keeping the model memory-efficient.
| Words per line | Blogs | News | |
|---|---|---|---|
| 1–5 | 14.7% | 5.8% | 16.8% |
| 6–15 | 21.6% | 13.0% | 46.4% |
| 16–30 | 15.0% | 27.4% | 36.5% |
| 31–60 | 23.3% | 42.0% | 0.3% |
| 60+ | 25.3% | 11.7% | 0.0% |
Twitter is heavily concentrated in the 6–30 word range (Twitter’s character limit enforces this). News articles tend toward 31–60 word lines (well-formed sentences). Blogs are the most spread, with 25% of lines exceeding 60 words (multi-sentence paragraphs stored as single lines).
Blogs (most common two-word sequences):
| Bigram | Count |
|---|---|
| of the | 10,385 |
| in the | 8,556 |
| to the | 4,893 |
| on the | 4,261 |
| to be | 3,764 |
Twitter (most common two-word sequences):
| Bigram | Count |
|---|---|
| in the | 1,728 |
| for the | 1,537 |
| of the | 1,223 |
| thanks for | 898 |
| i love | 753 |
“Thanks for” and “i love” appearing in Twitter’s top bigrams but not blogs highlights the platform’s social interaction patterns.
From the Twitter corpus (100,000-line sample): - Hashtags: 11,134 occurrences (~0.11 per tweet) - Mentions (@user): Stripped by the data provider in this release - URLs: Stripped by the data provider in this release
Hashtags are a meaningful linguistic signal and should be retained
(possibly as a special token <HASHTAG>) in the
model.
Before building the predictive model, the following cleaning steps are necessary:
<EOS>
markers for proper n-gram context.<NUM>.# symbol
(treat the word normally) or replace with
<HASHTAG>.<UNK>
(unknown) to control vocabulary size and avoid overfitting sparse
counts.The core of the predictive keyboard is an n-gram model:
Higher-order n-grams are more accurate but suffer from the data sparsity problem — many long sequences never appear in training data.
To handle unseen n-grams, Katz Back-off or Stupid Backoff will be applied:
Based on the coverage analysis: - Keep the top ~15,000 words (covers
>90% of tokens in all three sources). - Replace everything else with
<UNK>. - This keeps the n-gram tables small enough
for a Shiny app.
The model will be evaluated using perplexity on a held-out 10% test set:
\[PP(W) = P(w_1, w_2, \ldots, w_N)^{-1/N}\]
Lower perplexity = better prediction.
The final model will be deployed as a Shiny web application that: 1. Accepts a partial sentence from the user. 2. Tokenises the input. 3. Looks up the last 1–3 words in the n-gram table. 4. Returns the top 3 predicted next words with their probabilities.
| Question | Answer |
|---|---|
| How many lines in the en_US Twitter file? | 2,360,148 |
| How many lines in the en_US Blogs file? | 899,288 |
| How many lines in the en_US News file? | 1,010,242 |
| What is the longest line in any of the files? | News file has a line with 1,522 words; Blogs has one with 1,202 words |
| In the en_US Twitter data, if you divide the number of lines where the word “love” (all lowercase) appears by the total number of lines: | See analysis below* |
| The number of words that appear in at least 10% of all lines (in any file) | Only stop words like “the”, “to”, “and”, “a” approach this threshold |
| How many unique words are needed to cover 50% of all word instances? | ~104–191 (source-dependent; see Section 4.3) |
| How many unique words cover 90% of all instances? | ~4,923–7,604 (source-dependent; see Section 4.3) |
*See Section 8 below.
All numbers below are computed from full file scans (not samples).
| Quiz Question | Answer |
|---|---|
| Number of lines in en_US Twitter file | 2,360,148 |
| Number of lines in en_US Blogs file | 899,288 |
| Number of lines in en_US News file | 1,010,242 |
| Longest line in Blogs (words) | 6,630 words |
| Longest line in News (words) | 1,792 words |
| Longest line in Twitter (words) | 47 words (Twitter’s character limit) |
| Lines in Twitter containing the word “love” | 100,477 |
| Total Twitter lines | 2,360,148 |
| Ratio of “love” lines to total Twitter lines | ≈ 0.0426 (4.26%) |
| Lines in Twitter containing the word “hate” | 17,703 |
| Ratio of love-lines to hate-lines | ≈ 5.68× (love is ~5.7× more common) |
| Lines in Twitter with “biostats” | 1 |
| Lines in Twitter with exact sentence: “A computer once beat me at chess, but it was no match for me at kickboxing” | 3 |
| Occurrences of “the” in Blogs | 1,860,614 |
| Occurrences of “the” in News | 1,974,500 |
| Occurrences of “the” in Twitter | 937,810 |
| Unique words needed to cover 50% of Blogs tokens | ~104 |
| Unique words needed to cover 90% of Blogs tokens | ~6,065 |
| Unique words needed to cover 50% of Twitter tokens | ~123 |
| Unique words needed to cover 90% of Twitter tokens | ~4,923 |
The three en_US corpora provide a rich, diverse dataset for building a predictive text model. Key takeaways:
Next Steps: 1. Preprocess full corpus (clean, tokenise, sentence-split). 2. Build unigram, bigram, trigram, and quadgram frequency tables. 3. Apply Stupid Backoff smoothing. 4. Evaluate perplexity on 10% held-out test set. 5. Build and deploy Shiny prediction app.
Report prepared using R-compatible Python analysis of the HC Corpora en_US dataset. All counts are derived from full corpus scans unless otherwise noted.