This report presents an exploratory analysis of the HC Corpora English dataset, which forms the foundation of the Johns Hopkins / Coursera Data Science Capstone predictive-text project.
The corpus contains three files of raw, naturally-occurring English text:
| File | Domain | Style |
|---|---|---|
en_US.blogs.txt |
Personal blogs | Long-form, informal prose |
en_US.news.txt |
News articles | Formal, edited prose |
en_US.twitter.txt |
Tweets | Short, noisy, informal |
The goals of this report are to:
| Source | Lines | Total Words | Avg Words/Line | Max Words/Line | Unique Tokens |
|---|---|---|---|---|---|
| Blogs | 899,288 | 37,334,690 | 41.5 | 6726 | 321,142 |
| News | 1,010,242 | 34,372,720 | 34.1 | 11878 | 287,616 |
| 2,360,148 | 30,218,180 | 12.8 | 140 | 441,956 |
Key observation: Twitter has the most lines (approx. 2.4 M) but the fewest words per line (avg approx. 12.8), reflecting the 140-character limit. Blogs contain the longest documents (avg approx. 41 words/line).
How many unique word types are needed to cover 50%, 90%, and 99% of all tokens?
| Coverage | Unique word types needed | Total unique types |
|---|---|---|
| 50 % | 64 | 80,000 |
| 90 % | 14,067 | 80,000 |
| 99 % | 66,576 | 80,000 |
Implication for modelling: A dictionary of roughly the top 50,000 word types suffices to cover ~90% of running text. Capping vocabulary here dramatically reduces model size with minimal accuracy cost.
Figure 1 - Word-length distribution by source
Finding: All three sources peak at 3-character words (the, and, for). Twitter shows more very short tokens (1-2 chars) due to slang and abbreviations.
Figure 2 - Top 15 content words per source
Figure 3 - Words per line by source
Figure 4 - Cumulative vocabulary coverage
Word frequency in natural language follows Zipf’s Law: the \(n\)-th most common word appears roughly \(1/n\) times as often as the most common word.
Figure 5 - Zipf’s Law log-log frequency vs rank
The near-perfect straight line on the log-log plot confirms the power-law relationship. The top ~130 words account for 50% of all tokens, while the long tail of rare words inflates vocabulary size without contributing much coverage.
| Feature | Blogs | News | |
|---|---|---|---|
| Avg line length | High (41 wpl) | Med (34 wpl) | Low (13 wpl) |
| Formal register | Moderate | High | Low |
| Named entities | Low | High | Medium |
| Slang / abbrev. | Low | Rare | High |
| URLs present | Rare | Rare | Common |
| Emoticons / emoji | Rare | Never | Common |
| Source | Top bigrams | Top trigrams |
|---|---|---|
| Blogs | happy new · last year · every day · first time · right now | new year eve · last couple days · first time ever |
| News | new york · last year · percent said · white house · prime minister | new york city · president barack obama · prime minister said |
| right now · last night · so much · cant wait · happy birthday | cant wait see · happy new year · love you so |
| Noise type | Prevalence | Handling strategy |
|---|---|---|
| Profanity | Common on Twitter | Filter using a profanity lexicon |
| URLs & @mentions | Very common on Twitter | Remove with regex before tokenisation |
| Foreign-language fragments | Scattered | Language-ID filter (cld3) |
| Punctuation inside words | Blogs/news | tokens(remove_punct=TRUE) |
| Numeric strings | All sources | Remove or map to <NUM> token |
| Emoji & special chars | Strip with iconv(sub="byte") |
The predictive text system will use a Stupid Back-off n-gram language model (Brants et al., 2007). It is fast, memory-efficient, and well-suited to large vocabulary tasks without requiring normalised probability estimates.
Step 1 - Build n-gram frequency tables. Tokenise the corpus (lower-case, remove punctuation) and count every sequence of 2, 3, 4, and 5 consecutive words. Store as compressed lookup tables.
Step 2 - Score candidates with back-off. Given the last k typed words, look up matching 5-grams first. If none found, back off to 4-grams, then 3-grams, then 2-grams. Each step discounts the score by \(\lambda = 0.4\):
\[ S(w_i \mid w_{i-k+1}^{i-1}) = \begin{cases} \dfrac{f(w_{i-k+1}^{i})}{f(w_{i-k+1}^{i-1})} & \text{if } f > 0 \\[6pt] 0.4 \cdot S(w_i \mid w_{i-k+2}^{i-1}) & \text{otherwise} \end{cases} \]
Step 3 - Return top-3 predictions. Sort candidates by score descending and return the top 3 as button suggestions.
Step 4 - Handle unknown words. Map unseen words to
<UNK>. If context is completely unseen, fall back to
the 50 most frequent unigrams as safe defaults.
Figure 6 - Shiny app conceptual wireframe
| Panel | Purpose | Key UI elements |
|---|---|---|
| 1. Input | Capture typing | textInput(), character counter, clear button |
| 2. Predictions | Show suggestions | Three actionButton() word chips |
| 3. Settings | Tune the model | sliderInput() for n-gram order and vocab size |
| 4. Statistics | Debug info | Latency, model version, token count |
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 scales_1.4.0 tibble_3.3.1 tidyr_1.3.2
## [5] stringr_1.6.0 ggplot2_4.0.2 dplyr_1.2.1
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-4 gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.3
## [5] tidyselect_1.2.1 xml2_1.5.2 jquerylib_0.1.4 splines_4.5.3
## [9] textshaping_1.0.5 systemfonts_1.3.2 yaml_2.3.12 fastmap_1.2.0
## [13] lattice_0.22-9 R6_2.6.1 labeling_0.4.3 generics_0.1.4
## [17] knitr_1.51 svglite_2.2.2 bslib_0.10.0 pillar_1.11.1
## [21] RColorBrewer_1.1-3 rlang_1.1.7 cachem_1.1.0 stringi_1.8.7
## [25] xfun_0.57 sass_0.4.10 S7_0.2.1 viridisLite_0.4.3
## [29] cli_3.6.5 mgcv_1.9-4 withr_3.0.2 magrittr_2.0.4
## [33] digest_0.6.39 grid_4.5.3 rstudioapi_0.18.0 nlme_3.1-168
## [37] lifecycle_1.0.5 vctrs_0.7.2 evaluate_1.0.5 glue_1.8.0
## [41] farver_2.1.2 rmarkdown_2.31 purrr_1.2.1 tools_4.5.3
## [45] pkgconfig_2.0.3 htmltools_0.5.9
Report generated with R 4.5.3 on 2026-04-04