1 Introduction

This report presents an exploratory analysis of the HC Corpora English dataset, which forms the foundation of the Johns Hopkins / Coursera Data Science Capstone predictive-text project.

The corpus contains three files of raw, naturally-occurring English text:

File	Domain	Style
`en_US.blogs.txt`	Personal blogs	Long-form, informal prose
`en_US.news.txt`	News articles	Formal, edited prose
`en_US.twitter.txt`	Tweets	Short, noisy, informal

The goals of this report are to:

Summarise corpus size (lines, words, vocabulary).
Visualise word-length distributions, top unigrams, line-length patterns, and vocabulary coverage.
Highlight interesting linguistic findings.
Outline the algorithm and Shiny app to be built in later milestones.

2 Data Summaries

2.1 Line & word counts

Table 1 — Corpus size by source
Source	Lines	Total Words	Avg Words/Line	Max Words/Line	Unique Tokens
Blogs	899,288	37,334,690	41.5	6726	321,142
News	1,010,242	34,372,720	34.1	11878	287,616
Twitter	2,360,148	30,218,180	12.8	140	441,956

Key observation: Twitter has the most lines (approx. 2.4 M) but the fewest words per line (avg approx. 12.8), reflecting the 140-character limit. Blogs contain the longest documents (avg approx. 41 words/line).

2.2 Vocabulary coverage

How many unique word types are needed to cover 50%, 90%, and 99% of all tokens?

Table 2 — Vocabulary coverage thresholds
Coverage	Unique word types needed	Total unique types
50 %	64	80,000
90 %	14,067	80,000
99 %	66,576	80,000

Implication for modelling: A dictionary of roughly the top 50,000 word types suffices to cover ~90% of running text. Capping vocabulary here dramatically reduces model size with minimal accuracy cost.

3 Plots & Tables

3.1 Word-length distribution

Figure 1 - Word-length distribution by source

Finding: All three sources peak at 3-character words (the, and, for). Twitter shows more very short tokens (1-2 chars) due to slang and abbreviations.

3.2 Top unigrams (stop-words removed)

Figure 2 - Top 15 content words per source

3.3 Line-length distribution

Figure 3 - Words per line by source

3.4 Vocabulary coverage curve

Figure 4 - Cumulative vocabulary coverage

4 Interesting Findings

4.1 Zipf’s Law

Word frequency in natural language follows Zipf’s Law: the \(n\)-th most common word appears roughly \(1/n\) times as often as the most common word.

Figure 5 - Zipf’s Law log-log frequency vs rank

The near-perfect straight line on the log-log plot confirms the power-law relationship. The top ~130 words account for 50% of all tokens, while the long tail of rare words inflates vocabulary size without contributing much coverage.

4.2 Source-style divergence

Table 3 - Style characteristics by source
Feature	Blogs	News	Twitter
Avg line length	High (41 wpl)	Med (34 wpl)	Low (13 wpl)
Formal register	Moderate	High	Low
Named entities	Low	High	Medium
Slang / abbrev.	Low	Rare	High
URLs present	Rare	Rare	Common
Emoticons / emoji	Rare	Never	Common

4.3 Bigram & trigram examples

Table 4 - Top bigrams and trigrams per source (stop-words removed)
Source	Top bigrams	Top trigrams
Blogs	happy new · last year · every day · first time · right now	new year eve · last couple days · first time ever
News	new york · last year · percent said · white house · prime minister	new york city · president barack obama · prime minister said
Twitter	right now · last night · so much · cant wait · happy birthday	cant wait see · happy new year · love you so

4.4 Data quality notes

Noise type	Prevalence	Handling strategy
Profanity	Common on Twitter	Filter using a profanity lexicon
URLs & @mentions	Very common on Twitter	Remove with regex before tokenisation
Foreign-language fragments	Scattered	Language-ID filter (`cld3`)
Punctuation inside words	Blogs/news	`tokens(remove_punct=TRUE)`
Numeric strings	All sources	Remove or map to `<NUM>` token
Emoji & special chars	Twitter	Strip with `iconv(sub="byte")`

5 Algorithm & App Plan

5.1 Predictive model - Stupid Back-off

The predictive text system will use a Stupid Back-off n-gram language model (Brants et al., 2007). It is fast, memory-efficient, and well-suited to large vocabulary tasks without requiring normalised probability estimates.

Step 1 - Build n-gram frequency tables. Tokenise the corpus (lower-case, remove punctuation) and count every sequence of 2, 3, 4, and 5 consecutive words. Store as compressed lookup tables.

Step 2 - Score candidates with back-off. Given the last k typed words, look up matching 5-grams first. If none found, back off to 4-grams, then 3-grams, then 2-grams. Each step discounts the score by \(\lambda = 0.4\):

\[ S(w_i \mid w_{i-k+1}^{i-1}) = \begin{cases} \dfrac{f(w_{i-k+1}^{i})}{f(w_{i-k+1}^{i-1})} & \text{if } f > 0 \\[6pt] 0.4 \cdot S(w_i \mid w_{i-k+2}^{i-1}) & \text{otherwise} \end{cases} \]

Step 3 - Return top-3 predictions. Sort candidates by score descending and return the top 3 as button suggestions.

Step 4 - Handle unknown words. Map unseen words to <UNK>. If context is completely unseen, fall back to the 50 most frequent unigrams as safe defaults.

5.2 Shiny app design - 4-panel layout

Figure 6 - Shiny app conceptual wireframe

Panel	Purpose	Key UI elements
1. Input	Capture typing	`textInput()`, character counter, clear button
2. Predictions	Show suggestions	Three `actionButton()` word chips
3. Settings	Tune the model	`sliderInput()` for n-gram order and vocab size
4. Statistics	Debug info	Latency, model version, token count

6 Summary & Next Steps

Data: 4.27 M lines, ~102 M tokens across blogs, news, and Twitter.
Vocabulary: ~660 K unique types; top 50 K cover approximately 90% of tokens.
Key finding: Zipf power-law means a compact model achieves high coverage by focusing on frequent n-grams.
Noise: Twitter requires the most preprocessing (URLs, mentions, slang).
Model: Stupid Back-off over 2-5-gram tables; fast and memory-efficient.
App: 4-panel Shiny interface - input, predictions, settings, statistics.
Next: Build and evaluate the back-off model; profile memory and latency; deploy to shinyapps.io.

7 Reproducibility

## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.4.0 scales_1.4.0     tibble_3.3.1     tidyr_1.3.2     
## [5] stringr_1.6.0    ggplot2_4.0.2    dplyr_1.2.1     
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-4       gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.3    
##  [5] tidyselect_1.2.1   xml2_1.5.2         jquerylib_0.1.4    splines_4.5.3     
##  [9] textshaping_1.0.5  systemfonts_1.3.2  yaml_2.3.12        fastmap_1.2.0     
## [13] lattice_0.22-9     R6_2.6.1           labeling_0.4.3     generics_0.1.4    
## [17] knitr_1.51         svglite_2.2.2      bslib_0.10.0       pillar_1.11.1     
## [21] RColorBrewer_1.1-3 rlang_1.1.7        cachem_1.1.0       stringi_1.8.7     
## [25] xfun_0.57          sass_0.4.10        S7_0.2.1           viridisLite_0.4.3 
## [29] cli_3.6.5          mgcv_1.9-4         withr_3.0.2        magrittr_2.0.4    
## [33] digest_0.6.39      grid_4.5.3         rstudioapi_0.18.0  nlme_3.1-168      
## [37] lifecycle_1.0.5    vctrs_0.7.2        evaluate_1.0.5     glue_1.8.0        
## [41] farver_2.1.2       rmarkdown_2.31     purrr_1.2.1        tools_4.5.3       
## [45] pkgconfig_2.0.3    htmltools_0.5.9

Report generated with R 4.5.3 on 2026-04-04

NLP Corpus Exploratory Analysis

Data Science Capstone · Milestone Report

Pran Krishna Kar

April 04, 2026