1 Introduction

This report presents an exploratory analysis of the HC Corpora English dataset, which forms the foundation of the Johns Hopkins / Coursera Data Science Capstone predictive-text project.

The corpus contains three files of raw, naturally-occurring English text:

File Domain Style
en_US.blogs.txt Personal blogs Long-form, informal prose
en_US.news.txt News articles Formal, edited prose
en_US.twitter.txt Tweets Short, noisy, informal

The goals of this report are to:

  1. Summarise corpus size (lines, words, vocabulary).
  2. Visualise word-length distributions, top unigrams, line-length patterns, and vocabulary coverage.
  3. Highlight interesting linguistic findings.
  4. Outline the algorithm and Shiny app to be built in later milestones.

2 Data Summaries

2.1 Line & word counts

Table 1 — Corpus size by source
Source Lines Total Words Avg Words/Line Max Words/Line Unique Tokens
Blogs 899,288 37,334,690 41.5 6726 321,142
News 1,010,242 34,372,720 34.1 11878 287,616
Twitter 2,360,148 30,218,180 12.8 140 441,956

Key observation: Twitter has the most lines (approx. 2.4 M) but the fewest words per line (avg approx. 12.8), reflecting the 140-character limit. Blogs contain the longest documents (avg approx. 41 words/line).

2.2 Vocabulary coverage

How many unique word types are needed to cover 50%, 90%, and 99% of all tokens?

Table 2 — Vocabulary coverage thresholds
Coverage Unique word types needed Total unique types
50 % 64 80,000
90 % 14,067 80,000
99 % 66,576 80,000

Implication for modelling: A dictionary of roughly the top 50,000 word types suffices to cover ~90% of running text. Capping vocabulary here dramatically reduces model size with minimal accuracy cost.


3 Plots & Tables

3.1 Word-length distribution

Figure 1 - Word-length distribution by source

Figure 1 - Word-length distribution by source

Finding: All three sources peak at 3-character words (the, and, for). Twitter shows more very short tokens (1-2 chars) due to slang and abbreviations.

3.2 Top unigrams (stop-words removed)

Figure 2 - Top 15 content words per source

Figure 2 - Top 15 content words per source

3.3 Line-length distribution

Figure 3 - Words per line by source

Figure 3 - Words per line by source

3.4 Vocabulary coverage curve

Figure 4 - Cumulative vocabulary coverage

Figure 4 - Cumulative vocabulary coverage


4 Interesting Findings

4.1 Zipf’s Law

Word frequency in natural language follows Zipf’s Law: the \(n\)-th most common word appears roughly \(1/n\) times as often as the most common word.

Figure 5 - Zipf's Law log-log frequency vs rank

Figure 5 - Zipf’s Law log-log frequency vs rank

The near-perfect straight line on the log-log plot confirms the power-law relationship. The top ~130 words account for 50% of all tokens, while the long tail of rare words inflates vocabulary size without contributing much coverage.

4.2 Source-style divergence

Table 3 - Style characteristics by source
Feature Blogs News Twitter
Avg line length High (41 wpl) Med (34 wpl) Low (13 wpl)
Formal register Moderate High Low
Named entities Low High Medium
Slang / abbrev. Low Rare High
URLs present Rare Rare Common
Emoticons / emoji Rare Never Common

4.3 Bigram & trigram examples

Table 4 - Top bigrams and trigrams per source (stop-words removed)
Source Top bigrams Top trigrams
Blogs happy new · last year · every day · first time · right now new year eve · last couple days · first time ever
News new york · last year · percent said · white house · prime minister new york city · president barack obama · prime minister said
Twitter right now · last night · so much · cant wait · happy birthday cant wait see · happy new year · love you so

4.4 Data quality notes

Noise type Prevalence Handling strategy
Profanity Common on Twitter Filter using a profanity lexicon
URLs & @mentions Very common on Twitter Remove with regex before tokenisation
Foreign-language fragments Scattered Language-ID filter (cld3)
Punctuation inside words Blogs/news tokens(remove_punct=TRUE)
Numeric strings All sources Remove or map to <NUM> token
Emoji & special chars Twitter Strip with iconv(sub="byte")

5 Algorithm & App Plan

5.1 Predictive model - Stupid Back-off

The predictive text system will use a Stupid Back-off n-gram language model (Brants et al., 2007). It is fast, memory-efficient, and well-suited to large vocabulary tasks without requiring normalised probability estimates.

Step 1 - Build n-gram frequency tables. Tokenise the corpus (lower-case, remove punctuation) and count every sequence of 2, 3, 4, and 5 consecutive words. Store as compressed lookup tables.

Step 2 - Score candidates with back-off. Given the last k typed words, look up matching 5-grams first. If none found, back off to 4-grams, then 3-grams, then 2-grams. Each step discounts the score by \(\lambda = 0.4\):

\[ S(w_i \mid w_{i-k+1}^{i-1}) = \begin{cases} \dfrac{f(w_{i-k+1}^{i})}{f(w_{i-k+1}^{i-1})} & \text{if } f > 0 \\[6pt] 0.4 \cdot S(w_i \mid w_{i-k+2}^{i-1}) & \text{otherwise} \end{cases} \]

Step 3 - Return top-3 predictions. Sort candidates by score descending and return the top 3 as button suggestions.

Step 4 - Handle unknown words. Map unseen words to <UNK>. If context is completely unseen, fall back to the 50 most frequent unigrams as safe defaults.

5.2 Shiny app design - 4-panel layout

Figure 6 - Shiny app conceptual wireframe

Figure 6 - Shiny app conceptual wireframe

Panel Purpose Key UI elements
1. Input Capture typing textInput(), character counter, clear button
2. Predictions Show suggestions Three actionButton() word chips
3. Settings Tune the model sliderInput() for n-gram order and vocab size
4. Statistics Debug info Latency, model version, token count

6 Summary & Next Steps

  • Data: 4.27 M lines, ~102 M tokens across blogs, news, and Twitter.
  • Vocabulary: ~660 K unique types; top 50 K cover approximately 90% of tokens.
  • Key finding: Zipf power-law means a compact model achieves high coverage by focusing on frequent n-grams.
  • Noise: Twitter requires the most preprocessing (URLs, mentions, slang).
  • Model: Stupid Back-off over 2-5-gram tables; fast and memory-efficient.
  • App: 4-panel Shiny interface - input, predictions, settings, statistics.
  • Next: Build and evaluate the back-off model; profile memory and latency; deploy to shinyapps.io.

7 Reproducibility

## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.4.0 scales_1.4.0     tibble_3.3.1     tidyr_1.3.2     
## [5] stringr_1.6.0    ggplot2_4.0.2    dplyr_1.2.1     
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-4       gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.3    
##  [5] tidyselect_1.2.1   xml2_1.5.2         jquerylib_0.1.4    splines_4.5.3     
##  [9] textshaping_1.0.5  systemfonts_1.3.2  yaml_2.3.12        fastmap_1.2.0     
## [13] lattice_0.22-9     R6_2.6.1           labeling_0.4.3     generics_0.1.4    
## [17] knitr_1.51         svglite_2.2.2      bslib_0.10.0       pillar_1.11.1     
## [21] RColorBrewer_1.1-3 rlang_1.1.7        cachem_1.1.0       stringi_1.8.7     
## [25] xfun_0.57          sass_0.4.10        S7_0.2.1           viridisLite_0.4.3 
## [29] cli_3.6.5          mgcv_1.9-4         withr_3.0.2        magrittr_2.0.4    
## [33] digest_0.6.39      grid_4.5.3         rstudioapi_0.18.0  nlme_3.1-168      
## [37] lifecycle_1.0.5    vctrs_0.7.2        evaluate_1.0.5     glue_1.8.0        
## [41] farver_2.1.2       rmarkdown_2.31     purrr_1.2.1        tools_4.5.3       
## [45] pkgconfig_2.0.3    htmltools_0.5.9

Report generated with R 4.5.3 on 2026-04-04