The capstone uses the HC Corpora English datasets for blogs, news, and Twitter. The goal is to build an efficient next-word prediction model similar to those used in mobile smart keyboards. The final product is a Shiny application that returns the top predicted next word(s) for an input phrase.
A common baseline approach is an n-gram language model (unigrams, bigrams, trigrams, and 4-grams) combined with a backoff strategy for unseen word sequences. Text processing typically includes sampling, cleaning, tokenization, and frequency analysis.
## Using base_dir: data_raw/final/en_US
## en_US.blogs.txt size: 200.42 MB
## $blogs
## $blogs$lines
## [1] 899288
##
## $blogs$max_chars
## [1] 40833
##
## $blogs$max_words
## [1] 6630
##
##
## $news
## $news$lines
## [1] 1010242
##
## $news$max_chars
## [1] 11384
##
## $news$max_words
## [1] 1792
##
##
## $twitter
## $twitter$lines
## [1] 2360148
##
## $twitter$max_chars
## [1] 140
##
## $twitter$max_words
## [1] 47
##
## Longest line (any of 3): 40833 characters
## Max words in a line (any of 3): 6630 words
## $love_lines
## [1] 77639
##
## $hate_lines
## [1] 15561
##
## $ratio
## [1] 4.989332
##
## Love/Hate ratio (twitter): 4.99 (about 5)
## Metric Value
## 1: Blogs file size (MB) 200.42
## 2: Blog lines 899288.00
## 3: News lines 1010242.00
## 4: Twitter lines 2360148.00
## 5: Longest line (characters, any file) 40833.00
## 6: Max words in a line (any file) 6630.00
## 7: Twitter love lines 77639.00
## 8: Twitter hate lines 15561.00
## 9: Love/Hate ratio (twitter) 4.99
The three sources differ markedly: Twitter lines are short by design, while blogs can contain extremely long lines. This motivates representative sampling, consistent text cleaning, and efficient storage/lookup for n-grams to support a responsive Shiny app.
To keep analysis fast on a laptop, we use a random sample of lines from each dataset and apply conservative cleaning (lowercasing, URL removal, removing non-letter characters, and whitespace normalization). Sampling parameters for EDA (can be adjusted for larger samples)
## source sampled_lines
## 1: blogs 9042
## 2: news 9925
## 3: twitter 23366
## ngram N source
## 1: the 18527 blogs
## 2: and 10854 blogs
## 3: to 10746 blogs
## 4: a 9054 blogs
## 5: of 8623 blogs
## 6: i 8456 blogs
## 7: in 5992 blogs
## 8: that 4781 blogs
## 9: is 4401 blogs
## 10: it 4376 blogs
## 11: for 3804 blogs
## 12: you 3322 blogs
## 13: on 2847 blogs
## 14: with 2829 blogs
## 15: my 2704 blogs
## 16: was 2691 blogs
## 17: this 2540 blogs
## 18: have 2268 blogs
## 19: as 2267 blogs
## 20: be 2183 blogs
## 21: the 19274 news
## 22: a 8764 news
## 23: to 8754 news
## 24: and 8691 news
## 25: of 7605 news
## 26: in 6665 news
## 27: for 3500 news
## 28: that 3379 news
## 29: is 2783 news
## 30: on 2550 news
## 31: with 2430 news
## 32: said 2406 news
## 33: it 2291 news
## 34: was 2207 news
## 35: he 2120 news
## 36: at 2092 news
## 37: as 1854 news
## 38: i 1707 news
## 39: but 1503 news
## 40: his 1459 news
## 41: the 9203 twitter
## 42: to 7582 twitter
## 43: i 6909 twitter
## 44: a 6035 twitter
## 45: you 5500 twitter
## 46: and 4408 twitter
## 47: for 3811 twitter
## 48: in 3793 twitter
## 49: is 3581 twitter
## 50: of 3575 twitter
## 51: it 2996 twitter
## 52: my 2820 twitter
## 53: on 2691 twitter
## 54: that 2360 twitter
## 55: me 2014 twitter
## 56: at 1797 twitter
## 57: be 1785 twitter
## 58: with 1756 twitter
## 59: your 1739 twitter
## 60: this 1649 twitter
## ngram N source
## ngram N source
## 1: of the 1841 blogs
## 2: in the 1536 blogs
## 3: to the 857 blogs
## 4: on the 731 blogs
## 5: to be 685 blogs
## 6: for the 628 blogs
## 7: and the 584 blogs
## 8: and i 555 blogs
## 9: at the 528 blogs
## 10: i have 521 blogs
## 11: it is 476 blogs
## 12: i was 465 blogs
## 13: is a 440 blogs
## 14: in a 440 blogs
## 15: with the 437 blogs
## 16: it was 436 blogs
## 17: i am 424 blogs
## 18: that i 395 blogs
## 19: it s 381 blogs
## 20: from the 375 blogs
## 21: of the 1933 news
## 22: in the 1779 news
## 23: to the 791 news
## 24: on the 715 news
## 25: for the 693 news
## 26: at the 607 news
## 27: in a 542 news
## 28: and the 499 news
## 29: to be 437 news
## 30: with the 422 news
## 31: from the 360 news
## 32: of a 337 news
## 33: he said 322 news
## 34: as a 318 news
## 35: with a 291 news
## 36: is a 290 news
## 37: by the 282 news
## 38: for a 281 news
## 39: one of 271 news
## 40: it was 263 news
## 41: in the 818 twitter
## 42: for the 730 twitter
## 43: of the 561 twitter
## 44: on the 448 twitter
## 45: to the 447 twitter
## 46: to be 438 twitter
## 47: thanks for 427 twitter
## 48: at the 373 twitter
## 49: thank you 364 twitter
## 50: going to 341 twitter
## 51: i love 338 twitter
## 52: if you 335 twitter
## 53: for a 327 twitter
## 54: have a 313 twitter
## 55: i have 283 twitter
## 56: i am 273 twitter
## 57: to see 266 twitter
## 58: will be 252 twitter
## 59: i think 244 twitter
## 60: want to 243 twitter
## ngram N source
## total_tokens unique_words words_for_50pct words_for_90pct
## 1: 1001404 51915 142 7004
# Plan for Task 3–7 (brief) Prediction model approach The production
model will use n-grams (2-gram, 3-gram, 4-gram) with a backoff strategy:
Use last 3 words to query 4-grams Back off to last 2 words (3-grams),
then last 1 word (bigrams) Fall back to common unigrams when no match
exists Efficiency considerations
To support deployment on shinyapps.io / Posit Connect: sample data for training (or prune rare n-grams) store compact lookup tables ensure prediction runs quickly (low latency)
Profanity filtering (optional enhancement) - A profanity list can be used to remove offensive tokens from candidate predictions without deleting them from training text entirely.
sessionInfo()