1 Executive Summary

This report summarises the exploratory analysis carried out on the Coursera / SwiftKey HC Corpora dataset as part of the Data Science Capstone project. The ultimate goal is to build a next-word prediction algorithm and deploy it as a Shiny web application.

The corpus consists of three English-language text files sampled from blogs, news articles and Twitter. Key findings from this exploration are:

The three files together contain over 4 million lines and roughly 100 million words.
A relatively small vocabulary — the most frequent 500–1 000 words — accounts for the vast majority of text.
Twitter text is shorter, more informal and uses more punctuation and abbreviations than the other sources.
N-gram analysis (unigrams, bigrams, trigrams) shows clear patterns that can be exploited for prediction.

2 The Data

2.1 Source & Download

The data are provided by HC Corpora via the Coursera capstone page. After downloading and unzipping, the English-language corpus lives in final/en_US/ and contains three plain-text files:

File	Description
`en_US.blogs.txt`	Posts scraped from English-language blogs
`en_US.news.txt`	Articles scraped from English-language news sites
`en_US.twitter.txt`	Tweets scraped from Twitter

2.2 Loading the Data

# Adjust this path to wherever you unzipped the Coursera dataset
data_path <- "D:/final/en_US/"

blogs_raw   <- readLines(paste0(data_path, "en_US.blogs.txt"),
                         encoding = "UTF-8", skipNul = TRUE)
news_raw    <- readLines(paste0(data_path, "en_US.news.txt"),
                         encoding = "UTF-8", skipNul = TRUE)
twitter_raw <- readLines(paste0(data_path, "en_US.twitter.txt"),
                         encoding = "UTF-8", skipNul = TRUE)

Note: Files are large (~200 MB each). skipNul = TRUE avoids embedded-null errors common in this dataset.

3 Basic Summary Statistics

Table 1 – File-level summary statistics
Source	File Size (MB)	Line Count	Word Count	Char Count
Blogs	210.2	899,288	37,334,131	206,824,505
News	205.8	1,010,206	34,371,031	203,214,543
Twitter	167.1	2,360,148	30,373,583	162,096,241

Key takeaway: The corpus is very large — nearly 4.3 million lines and over 100 million words. Working with the full dataset is computationally expensive, so the analysis below uses a random 1 % sample from each source, which is large enough to reveal robust patterns.

4 Sampling & Cleaning

Cleaning steps applied:

Convert to lower case
Remove numbers, punctuation and special characters (keeping apostrophes for contractions)
Collapse extra whitespace

5 Distribution of Line Lengths

Observation: Twitter lines are tightly concentrated below 30 words (enforced by character limits). Blog posts have the widest spread — some entries exceed 150 words per line.

6 Word Frequency Analysis (Unigrams)

Table 2 – Top 10 Content Words per Source
Source	Word	Count
Blogs	time	859
Blogs	people	594
Blogs	day	517
Blogs	love	434
Blogs	life	407
Blogs	world	320
Blogs	home	305
Blogs	don	288
Blogs	book	284
Blogs	feel	276
News	time	547
News	people	439
News	school	347
News	city	345
News	percent	345
News	day	332
News	game	331
News	million	322
News	county	317
News	season	307
Twitter	love	1,044
Twitter	day	964
Twitter	rt	910
Twitter	time	788
Twitter	lol	694
Twitter	people	527
Twitter	tonight	497
Twitter	follow	492
Twitter	happy	478
Twitter	night	449

7 Word Cloud

8 Coverage: How Many Words Do We Need?

A key question for building an efficient model is: how many unique words cover 50 % and 90 % of all word instances?

Table 3 – Vocabulary size required for coverage targets
Coverage Target	Unique Words Needed
50 %	142
90 %	6,898

Insight: Just ~142 unique words cover half of all text — confirming that a relatively small vocabulary can power a useful predictor.

9 N-gram Analysis

N-grams (sequences of n consecutive words) are the foundation of the prediction model.

9.1 Bigrams (2-word sequences)

9.2 Trigrams (3-word sequences)

Table 4 – Unique N-gram counts in the 1 % sample
N-gram Type	Unique N-grams in Sample
Unigrams (1-word)	51,030
Bigrams (2-word)	440,077
Trigrams (3-word)	775,067

10 Key Findings

Scale: The corpus is massive (~100 M words). Sampling 1 % still yields a rich, representative dataset for modelling.
Vocabulary efficiency: Fewer than 142 words cover 50 % of text; the long tail of rare words can be pruned aggressively without hurting user experience.
Source differences: Twitter data is shorter, more informal and noisier; blogs and news are closer to formal written English. The model will need to handle both registers.
N-gram patterns: Common bigrams and trigrams (e.g., “of the”, “in the”, “a lot of”) are highly predictable — a strong signal for the prediction algorithm.

11 Plan for the Prediction Algorithm & Shiny App

11.1 Algorithm: Stupid Backoff N-gram Model

The prediction algorithm will be based on N-gram language modelling with Stupid Backoff smoothing:

Step	Description
1. Build N-gram tables	Compute frequency tables for unigrams, bigrams, trigrams and quadrigrams from the full corpus
2. Store efficiently	Save as compressed `data.table` objects; keep only N-grams appearing ≥ 2 times
3. Predict	Given the last 1–3 words typed, look up the matching (n−1)-gram and return the top-k most likely next words
4. Backoff	If a trigram prefix isn’t found, fall back to bigram; if that fails, fall back to unigram frequencies
5. Profanity filter	Strip any profane words from suggestions using a blocklist

This approach is fast, interpretable and works well even on modest hardware — important for a responsive Shiny app.

11.2 Shiny App Design

The app will provide a real-time, as-you-type next-word prediction interface:

Input box – user types a sentence fragment
Suggestion bar – top 3 predicted next words appear as clickable buttons
Word accepted – clicking a word appends it to the input and re-predicts
Source toggle (stretch goal) – let the user choose a “formal” (blogs/news) or “casual” (Twitter) language model

The UI will be kept minimal so that the prediction latency stays well under one second.

12 Next Steps

Build N-gram frequency tables on the full corpus (not just the 1 % sample)
Benchmark prediction accuracy using a held-out 10 % test set (top-1, top-3 accuracy)
Optimise memory footprint (target < 50 MB loaded model)
Develop and deploy the Shiny application

13 Reproducibility

All code used in this report is available in the associated GitHub repository. The analysis was performed with:

## R version 4.6.0 (2026-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=Russian_Kazakhstan.utf8  LC_CTYPE=Russian_Kazakhstan.utf8   
## [3] LC_MONETARY=Russian_Kazakhstan.utf8 LC_NUMERIC=C                       
## [5] LC_TIME=Russian_Kazakhstan.utf8    
## 
## time zone: Asia/Qyzylorda
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0   wordcloud_2.6      RColorBrewer_1.1-3 scales_1.4.0      
##  [5] tidytext_0.4.3     lubridate_1.9.5    forcats_1.0.1      stringr_1.6.0     
##  [9] dplyr_1.2.1        purrr_1.2.2        readr_2.2.0        tidyr_1.3.2       
## [13] tibble_3.3.1       ggplot2_4.0.3      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] janeaustenr_1.0.0 sass_0.4.10       generics_0.1.4    xml2_1.5.2       
##  [5] stringi_1.8.7     lattice_0.22-9    hms_1.1.4         digest_0.6.39    
##  [9] magrittr_2.0.5    evaluate_1.0.5    grid_4.6.0        timechange_0.4.0 
## [13] fastmap_1.2.0     jsonlite_2.0.0    Matrix_1.7-5      viridisLite_0.4.3
## [17] textshaping_1.0.5 jquerylib_0.1.4   cli_3.6.6         rlang_1.2.0      
## [21] tokenizers_0.3.0  withr_3.0.2       cachem_1.1.0      yaml_2.3.12      
## [25] tools_4.6.0       tzdb_0.5.0        vctrs_0.7.3       R6_2.6.1         
## [29] lifecycle_1.0.5   pkgconfig_2.0.3   pillar_1.11.1     bslib_0.10.0     
## [33] gtable_0.3.6      glue_1.8.1        Rcpp_1.1.1-1.1    systemfonts_1.3.2
## [37] xfun_0.57         tidyselect_1.2.1  rstudioapi_0.18.0 knitr_1.51       
## [41] farver_2.1.2      htmltools_0.5.9   SnowballC_0.7.1   labeling_0.4.3   
## [45] rmarkdown_2.31    svglite_2.2.2     compiler_4.6.0    S7_0.2.2

Data Science Capstone: Milestone Report

Exploratory Analysis of the SwiftKey NLP Dataset

Your Name

2026-05-07