1 Executive Summary

This report summarises the exploratory analysis carried out on the Coursera / SwiftKey HC Corpora dataset as part of the Data Science Capstone project. The ultimate goal is to build a next-word prediction algorithm and deploy it as a Shiny web application.

The corpus consists of three English-language text files sampled from blogs, news articles and Twitter. Key findings from this exploration are:

  • The three files together contain over 4 million lines and roughly 100 million words.
  • A relatively small vocabulary — the most frequent 500–1 000 words — accounts for the vast majority of text.
  • Twitter text is shorter, more informal and uses more punctuation and abbreviations than the other sources.
  • N-gram analysis (unigrams, bigrams, trigrams) shows clear patterns that can be exploited for prediction.

2 The Data

2.1 Source & Download

The data are provided by HC Corpora via the Coursera capstone page. After downloading and unzipping, the English-language corpus lives in final/en_US/ and contains three plain-text files:

File Description
en_US.blogs.txt Posts scraped from English-language blogs
en_US.news.txt Articles scraped from English-language news sites
en_US.twitter.txt Tweets scraped from Twitter

2.2 Loading the Data

# Adjust this path to wherever you unzipped the Coursera dataset
data_path <- "D:/final/en_US/"

blogs_raw   <- readLines(paste0(data_path, "en_US.blogs.txt"),
                         encoding = "UTF-8", skipNul = TRUE)
news_raw    <- readLines(paste0(data_path, "en_US.news.txt"),
                         encoding = "UTF-8", skipNul = TRUE)
twitter_raw <- readLines(paste0(data_path, "en_US.twitter.txt"),
                         encoding = "UTF-8", skipNul = TRUE)

Note: Files are large (~200 MB each). skipNul = TRUE avoids embedded-null errors common in this dataset.


3 Basic Summary Statistics

Table 1 – File-level summary statistics
Source File Size (MB) Line Count Word Count Char Count
Blogs 210.2 899,288 37,334,131 206,824,505
News 205.8 1,010,206 34,371,031 203,214,543
Twitter 167.1 2,360,148 30,373,583 162,096,241

Key takeaway: The corpus is very large — nearly 4.3 million lines and over 100 million words. Working with the full dataset is computationally expensive, so the analysis below uses a random 1 % sample from each source, which is large enough to reveal robust patterns.


4 Sampling & Cleaning

Cleaning steps applied:

  • Convert to lower case
  • Remove numbers, punctuation and special characters (keeping apostrophes for contractions)
  • Collapse extra whitespace

5 Distribution of Line Lengths

Observation: Twitter lines are tightly concentrated below 30 words (enforced by character limits). Blog posts have the widest spread — some entries exceed 150 words per line.


6 Word Frequency Analysis (Unigrams)

Table 2 – Top 10 Content Words per Source
Source Word Count
Blogs time 859
Blogs people 594
Blogs day 517
Blogs love 434
Blogs life 407
Blogs world 320
Blogs home 305
Blogs don 288
Blogs book 284
Blogs feel 276
News time 547
News people 439
News school 347
News city 345
News percent 345
News day 332
News game 331
News million 322
News county 317
News season 307
Twitter love 1,044
Twitter day 964
Twitter rt 910
Twitter time 788
Twitter lol 694
Twitter people 527
Twitter tonight 497
Twitter follow 492
Twitter happy 478
Twitter night 449

7 Word Cloud


8 Coverage: How Many Words Do We Need?

A key question for building an efficient model is: how many unique words cover 50 % and 90 % of all word instances?

Table 3 – Vocabulary size required for coverage targets
Coverage Target Unique Words Needed
50 % 142
90 % 6,898

Insight: Just ~142 unique words cover half of all text — confirming that a relatively small vocabulary can power a useful predictor.


9 N-gram Analysis

N-grams (sequences of n consecutive words) are the foundation of the prediction model.

9.1 Bigrams (2-word sequences)

9.2 Trigrams (3-word sequences)

Table 4 – Unique N-gram counts in the 1 % sample
N-gram Type Unique N-grams in Sample
Unigrams (1-word) 51,030
Bigrams (2-word) 440,077
Trigrams (3-word) 775,067

10 Key Findings

  1. Scale: The corpus is massive (~100 M words). Sampling 1 % still yields a rich, representative dataset for modelling.
  2. Vocabulary efficiency: Fewer than 142 words cover 50 % of text; the long tail of rare words can be pruned aggressively without hurting user experience.
  3. Source differences: Twitter data is shorter, more informal and noisier; blogs and news are closer to formal written English. The model will need to handle both registers.
  4. N-gram patterns: Common bigrams and trigrams (e.g., “of the”, “in the”, “a lot of”) are highly predictable — a strong signal for the prediction algorithm.

11 Plan for the Prediction Algorithm & Shiny App

11.1 Algorithm: Stupid Backoff N-gram Model

The prediction algorithm will be based on N-gram language modelling with Stupid Backoff smoothing:

Step Description
1. Build N-gram tables Compute frequency tables for unigrams, bigrams, trigrams and quadrigrams from the full corpus
2. Store efficiently Save as compressed data.table objects; keep only N-grams appearing ≥ 2 times
3. Predict Given the last 1–3 words typed, look up the matching (n−1)-gram and return the top-k most likely next words
4. Backoff If a trigram prefix isn’t found, fall back to bigram; if that fails, fall back to unigram frequencies
5. Profanity filter Strip any profane words from suggestions using a blocklist

This approach is fast, interpretable and works well even on modest hardware — important for a responsive Shiny app.

11.2 Shiny App Design

The app will provide a real-time, as-you-type next-word prediction interface:

  • Input box – user types a sentence fragment
  • Suggestion bar – top 3 predicted next words appear as clickable buttons
  • Word accepted – clicking a word appends it to the input and re-predicts
  • Source toggle (stretch goal) – let the user choose a “formal” (blogs/news) or “casual” (Twitter) language model

The UI will be kept minimal so that the prediction latency stays well under one second.


12 Next Steps


13 Reproducibility

All code used in this report is available in the associated GitHub repository. The analysis was performed with:

## R version 4.6.0 (2026-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=Russian_Kazakhstan.utf8  LC_CTYPE=Russian_Kazakhstan.utf8   
## [3] LC_MONETARY=Russian_Kazakhstan.utf8 LC_NUMERIC=C                       
## [5] LC_TIME=Russian_Kazakhstan.utf8    
## 
## time zone: Asia/Qyzylorda
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0   wordcloud_2.6      RColorBrewer_1.1-3 scales_1.4.0      
##  [5] tidytext_0.4.3     lubridate_1.9.5    forcats_1.0.1      stringr_1.6.0     
##  [9] dplyr_1.2.1        purrr_1.2.2        readr_2.2.0        tidyr_1.3.2       
## [13] tibble_3.3.1       ggplot2_4.0.3      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] janeaustenr_1.0.0 sass_0.4.10       generics_0.1.4    xml2_1.5.2       
##  [5] stringi_1.8.7     lattice_0.22-9    hms_1.1.4         digest_0.6.39    
##  [9] magrittr_2.0.5    evaluate_1.0.5    grid_4.6.0        timechange_0.4.0 
## [13] fastmap_1.2.0     jsonlite_2.0.0    Matrix_1.7-5      viridisLite_0.4.3
## [17] textshaping_1.0.5 jquerylib_0.1.4   cli_3.6.6         rlang_1.2.0      
## [21] tokenizers_0.3.0  withr_3.0.2       cachem_1.1.0      yaml_2.3.12      
## [25] tools_4.6.0       tzdb_0.5.0        vctrs_0.7.3       R6_2.6.1         
## [29] lifecycle_1.0.5   pkgconfig_2.0.3   pillar_1.11.1     bslib_0.10.0     
## [33] gtable_0.3.6      glue_1.8.1        Rcpp_1.1.1-1.1    systemfonts_1.3.2
## [37] xfun_0.57         tidyselect_1.2.1  rstudioapi_0.18.0 knitr_1.51       
## [41] farver_2.1.2      htmltools_0.5.9   SnowballC_0.7.1   labeling_0.4.3   
## [45] rmarkdown_2.31    svglite_2.2.2     compiler_4.6.0    S7_0.2.2