This report summarises the exploratory analysis carried out on the Coursera / SwiftKey HC Corpora dataset as part of the Data Science Capstone project. The ultimate goal is to build a next-word prediction algorithm and deploy it as a Shiny web application.
The corpus consists of three English-language text files sampled from blogs, news articles and Twitter. Key findings from this exploration are:
The data are provided by HC Corpora via the Coursera
capstone page. After downloading and unzipping, the English-language
corpus lives in final/en_US/ and contains three plain-text
files:
| File | Description |
|---|---|
en_US.blogs.txt |
Posts scraped from English-language blogs |
en_US.news.txt |
Articles scraped from English-language news sites |
en_US.twitter.txt |
Tweets scraped from Twitter |
# Adjust this path to wherever you unzipped the Coursera dataset
data_path <- "D:/final/en_US/"
blogs_raw <- readLines(paste0(data_path, "en_US.blogs.txt"),
encoding = "UTF-8", skipNul = TRUE)
news_raw <- readLines(paste0(data_path, "en_US.news.txt"),
encoding = "UTF-8", skipNul = TRUE)
twitter_raw <- readLines(paste0(data_path, "en_US.twitter.txt"),
encoding = "UTF-8", skipNul = TRUE)Note: Files are large (~200 MB each).
skipNul = TRUEavoids embedded-null errors common in this dataset.
| Source | File Size (MB) | Line Count | Word Count | Char Count |
|---|---|---|---|---|
| Blogs | 210.2 | 899,288 | 37,334,131 | 206,824,505 |
| News | 205.8 | 1,010,206 | 34,371,031 | 203,214,543 |
| 167.1 | 2,360,148 | 30,373,583 | 162,096,241 |
Key takeaway: The corpus is very large — nearly 4.3 million lines and over 100 million words. Working with the full dataset is computationally expensive, so the analysis below uses a random 1 % sample from each source, which is large enough to reveal robust patterns.
Cleaning steps applied:
Observation: Twitter lines are tightly concentrated below 30 words (enforced by character limits). Blog posts have the widest spread — some entries exceed 150 words per line.
| Source | Word | Count |
|---|---|---|
| Blogs | time | 859 |
| Blogs | people | 594 |
| Blogs | day | 517 |
| Blogs | love | 434 |
| Blogs | life | 407 |
| Blogs | world | 320 |
| Blogs | home | 305 |
| Blogs | don | 288 |
| Blogs | book | 284 |
| Blogs | feel | 276 |
| News | time | 547 |
| News | people | 439 |
| News | school | 347 |
| News | city | 345 |
| News | percent | 345 |
| News | day | 332 |
| News | game | 331 |
| News | million | 322 |
| News | county | 317 |
| News | season | 307 |
| love | 1,044 | |
| day | 964 | |
| rt | 910 | |
| time | 788 | |
| lol | 694 | |
| people | 527 | |
| tonight | 497 | |
| follow | 492 | |
| happy | 478 | |
| night | 449 |
A key question for building an efficient model is: how many unique words cover 50 % and 90 % of all word instances?
| Coverage Target | Unique Words Needed |
|---|---|
| 50 % | 142 |
| 90 % | 6,898 |
Insight: Just ~142 unique words cover half of all text — confirming that a relatively small vocabulary can power a useful predictor.
N-grams (sequences of n consecutive words) are the foundation of the prediction model.
| N-gram Type | Unique N-grams in Sample |
|---|---|
| Unigrams (1-word) | 51,030 |
| Bigrams (2-word) | 440,077 |
| Trigrams (3-word) | 775,067 |
The prediction algorithm will be based on N-gram language modelling with Stupid Backoff smoothing:
| Step | Description |
|---|---|
| 1. Build N-gram tables | Compute frequency tables for unigrams, bigrams, trigrams and quadrigrams from the full corpus |
| 2. Store efficiently | Save as compressed data.table objects; keep only
N-grams appearing ≥ 2 times |
| 3. Predict | Given the last 1–3 words typed, look up the matching (n−1)-gram and return the top-k most likely next words |
| 4. Backoff | If a trigram prefix isn’t found, fall back to bigram; if that fails, fall back to unigram frequencies |
| 5. Profanity filter | Strip any profane words from suggestions using a blocklist |
This approach is fast, interpretable and works well even on modest hardware — important for a responsive Shiny app.
The app will provide a real-time, as-you-type next-word prediction interface:
The UI will be kept minimal so that the prediction latency stays well under one second.
All code used in this report is available in the associated GitHub repository. The analysis was performed with:
## R version 4.6.0 (2026-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=Russian_Kazakhstan.utf8 LC_CTYPE=Russian_Kazakhstan.utf8
## [3] LC_MONETARY=Russian_Kazakhstan.utf8 LC_NUMERIC=C
## [5] LC_TIME=Russian_Kazakhstan.utf8
##
## time zone: Asia/Qyzylorda
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 wordcloud_2.6 RColorBrewer_1.1-3 scales_1.4.0
## [5] tidytext_0.4.3 lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0
## [9] dplyr_1.2.1 purrr_1.2.2 readr_2.2.0 tidyr_1.3.2
## [13] tibble_3.3.1 ggplot2_4.0.3 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] janeaustenr_1.0.0 sass_0.4.10 generics_0.1.4 xml2_1.5.2
## [5] stringi_1.8.7 lattice_0.22-9 hms_1.1.4 digest_0.6.39
## [9] magrittr_2.0.5 evaluate_1.0.5 grid_4.6.0 timechange_0.4.0
## [13] fastmap_1.2.0 jsonlite_2.0.0 Matrix_1.7-5 viridisLite_0.4.3
## [17] textshaping_1.0.5 jquerylib_0.1.4 cli_3.6.6 rlang_1.2.0
## [21] tokenizers_0.3.0 withr_3.0.2 cachem_1.1.0 yaml_2.3.12
## [25] tools_4.6.0 tzdb_0.5.0 vctrs_0.7.3 R6_2.6.1
## [29] lifecycle_1.0.5 pkgconfig_2.0.3 pillar_1.11.1 bslib_0.10.0
## [33] gtable_0.3.6 glue_1.8.1 Rcpp_1.1.1-1.1 systemfonts_1.3.2
## [37] xfun_0.57 tidyselect_1.2.1 rstudioapi_0.18.0 knitr_1.51
## [41] farver_2.1.2 htmltools_0.5.9 SnowballC_0.7.1 labeling_0.4.3
## [45] rmarkdown_2.31 svglite_2.2.2 compiler_4.6.0 S7_0.2.2