1 Task 0: Understanding the problem
- 1.1 What data do we have?
- 1.2 Standard tools/models for this task
2 Task 1: Basic dataset summaries (required quiz checks)
- 2.1 Interpretation
3 Task 2: Exploratory Data Analysis (EDA)
- 3.1 Create a sampled, cleaned dataset for EDA
4 Distribution of words per line (histograms)
5 Appendix: Session info

1 Task 0: Understanding the problem

1.1 What data do we have?

The capstone uses the HC Corpora English datasets for blogs, news, and Twitter. The goal is to build an efficient next-word prediction model similar to those used in mobile smart keyboards. The final product is a Shiny application that returns the top predicted next word(s) for an input phrase.

1.2 Standard tools/models for this task

A common baseline approach is an n-gram language model (unigrams, bigrams, trigrams, and 4-grams) combined with a backoff strategy for unseen word sequences. Text processing typically includes sampling, cleaning, tokenization, and frequency analysis.

2 Task 1: Basic dataset summaries (required quiz checks)

## Using base_dir: data_raw/final/en_US

## en_US.blogs.txt size: 200.42 MB

## $blogs
## $blogs$lines
## [1] 899288
## 
## $blogs$max_chars
## [1] 40833
## 
## $blogs$max_words
## [1] 6630
## 
## 
## $news
## $news$lines
## [1] 1010242
## 
## $news$max_chars
## [1] 11384
## 
## $news$max_words
## [1] 1792
## 
## 
## $twitter
## $twitter$lines
## [1] 2360148
## 
## $twitter$max_chars
## [1] 140
## 
## $twitter$max_words
## [1] 47

## 
## Longest line (any of 3): 40833 characters

## Max words in a line (any of 3): 6630 words

## $love_lines
## [1] 77639
## 
## $hate_lines
## [1] 15561
## 
## $ratio
## [1] 4.989332

## 
## Love/Hate ratio (twitter): 4.99 (about 5)

##                                 Metric      Value
## 1:                Blogs file size (MB)     200.42
## 2:                          Blog lines  899288.00
## 3:                          News lines 1010242.00
## 4:                       Twitter lines 2360148.00
## 5: Longest line (characters, any file)   40833.00
## 6:      Max words in a line (any file)    6630.00
## 7:                  Twitter love lines   77639.00
## 8:                  Twitter hate lines   15561.00
## 9:           Love/Hate ratio (twitter)       4.99

2.1 Interpretation

The three sources differ markedly: Twitter lines are short by design, while blogs can contain extremely long lines. This motivates representative sampling, consistent text cleaning, and efficient storage/lookup for n-grams to support a responsive Shiny app.

3 Task 2: Exploratory Data Analysis (EDA)

3.1 Create a sampled, cleaned dataset for EDA

To keep analysis fast on a laptop, we use a random sample of lines from each dataset and apply conservative cleaning (lowercasing, URL removal, removing non-letter characters, and whitespace normalization). Sampling parameters for EDA (can be adjusted for larger samples)

##     source sampled_lines
## 1:   blogs          9042
## 2:    news          9925
## 3: twitter         23366

4 Distribution of words per line (histograms)

4.1 These plots help understand text “shape” by source.

##     ngram     N  source
##  1:   the 18527   blogs
##  2:   and 10854   blogs
##  3:    to 10746   blogs
##  4:     a  9054   blogs
##  5:    of  8623   blogs
##  6:     i  8456   blogs
##  7:    in  5992   blogs
##  8:  that  4781   blogs
##  9:    is  4401   blogs
## 10:    it  4376   blogs
## 11:   for  3804   blogs
## 12:   you  3322   blogs
## 13:    on  2847   blogs
## 14:  with  2829   blogs
## 15:    my  2704   blogs
## 16:   was  2691   blogs
## 17:  this  2540   blogs
## 18:  have  2268   blogs
## 19:    as  2267   blogs
## 20:    be  2183   blogs
## 21:   the 19274    news
## 22:     a  8764    news
## 23:    to  8754    news
## 24:   and  8691    news
## 25:    of  7605    news
## 26:    in  6665    news
## 27:   for  3500    news
## 28:  that  3379    news
## 29:    is  2783    news
## 30:    on  2550    news
## 31:  with  2430    news
## 32:  said  2406    news
## 33:    it  2291    news
## 34:   was  2207    news
## 35:    he  2120    news
## 36:    at  2092    news
## 37:    as  1854    news
## 38:     i  1707    news
## 39:   but  1503    news
## 40:   his  1459    news
## 41:   the  9203 twitter
## 42:    to  7582 twitter
## 43:     i  6909 twitter
## 44:     a  6035 twitter
## 45:   you  5500 twitter
## 46:   and  4408 twitter
## 47:   for  3811 twitter
## 48:    in  3793 twitter
## 49:    is  3581 twitter
## 50:    of  3575 twitter
## 51:    it  2996 twitter
## 52:    my  2820 twitter
## 53:    on  2691 twitter
## 54:  that  2360 twitter
## 55:    me  2014 twitter
## 56:    at  1797 twitter
## 57:    be  1785 twitter
## 58:  with  1756 twitter
## 59:  your  1739 twitter
## 60:  this  1649 twitter
##     ngram     N  source

##          ngram    N  source
##  1:     of the 1841   blogs
##  2:     in the 1536   blogs
##  3:     to the  857   blogs
##  4:     on the  731   blogs
##  5:      to be  685   blogs
##  6:    for the  628   blogs
##  7:    and the  584   blogs
##  8:      and i  555   blogs
##  9:     at the  528   blogs
## 10:     i have  521   blogs
## 11:      it is  476   blogs
## 12:      i was  465   blogs
## 13:       is a  440   blogs
## 14:       in a  440   blogs
## 15:   with the  437   blogs
## 16:     it was  436   blogs
## 17:       i am  424   blogs
## 18:     that i  395   blogs
## 19:       it s  381   blogs
## 20:   from the  375   blogs
## 21:     of the 1933    news
## 22:     in the 1779    news
## 23:     to the  791    news
## 24:     on the  715    news
## 25:    for the  693    news
## 26:     at the  607    news
## 27:       in a  542    news
## 28:    and the  499    news
## 29:      to be  437    news
## 30:   with the  422    news
## 31:   from the  360    news
## 32:       of a  337    news
## 33:    he said  322    news
## 34:       as a  318    news
## 35:     with a  291    news
## 36:       is a  290    news
## 37:     by the  282    news
## 38:      for a  281    news
## 39:     one of  271    news
## 40:     it was  263    news
## 41:     in the  818 twitter
## 42:    for the  730 twitter
## 43:     of the  561 twitter
## 44:     on the  448 twitter
## 45:     to the  447 twitter
## 46:      to be  438 twitter
## 47: thanks for  427 twitter
## 48:     at the  373 twitter
## 49:  thank you  364 twitter
## 50:   going to  341 twitter
## 51:     i love  338 twitter
## 52:     if you  335 twitter
## 53:      for a  327 twitter
## 54:     have a  313 twitter
## 55:     i have  283 twitter
## 56:       i am  273 twitter
## 57:     to see  266 twitter
## 58:    will be  252 twitter
## 59:    i think  244 twitter
## 60:    want to  243 twitter
##          ngram    N  source

4.2 Vocabulary coverage summary (table)

##    total_tokens unique_words words_for_50pct words_for_90pct
## 1:      1001404        51915             142            7004

4.3 C) Vocabulary coverage plot

# Plan for Task 3–7 (brief) Prediction model approach The production model will use n-grams (2-gram, 3-gram, 4-gram) with a backoff strategy: Use last 3 words to query 4-grams Back off to last 2 words (3-grams), then last 1 word (bigrams) Fall back to common unigrams when no match exists Efficiency considerations

To support deployment on shinyapps.io / Posit Connect: sample data for training (or prune rare n-grams) store compact lookup tables ensure prediction runs quickly (low latency)

Profanity filtering (optional enhancement) - A profanity list can be used to remove offensive tokens from candidate predictions without deleting them from training text entirely.

5 Appendix: Session info

sessionInfo()

SwiftKey Capstone: Exploratory Data Analysis (HC Corpora)

Emmanuel Benyeogor

2026-02-17