NLP Capstone - Predictive Modeling Across Twitter, Blogs, and News

2025-05-07

1. 📄 Project Overview

Goal: Predict the next word using a backoff n-gram language model

Dataset: English corpora from Twitter, Blogs, and News

Approach:

Clean and tokenize input

Generate n-gram frequency tables

Use backoff algorithm to predict next word

Tools: R, Shiny, tidytext, DT

2. 📂 Data Loading & Sampling

read_sample <- function(path, n = 5000, seed = 123) {
  set.seed(seed)
  lines <- readLines(path, encoding = "UTF-8", skipNul = TRUE)
  sample(lines, size = n)
}

twitter_lines <- read_sample("data/en_US.twitter.txt")
blogs_lines   <- read_sample("data/en_US.blogs.txt")
news_lines    <- read_sample("data/en_US.news.txt")

Each file contains millions of lines

We sample 5,000 lines for efficiency

3. 🤖 Tokenization & Unigrams

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)
library(stringr)

twitter_df <- tibble(text = twitter_lines)

tokens <- twitter_df %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "^[a-z']+$"))

unigram_model <- tokens %>%
  count(word, sort = TRUE)

Text converted to lowercase

Only alphabetic words kept

Words counted to build unigram frequency table

4. 🔄 Bigrams and Trigrams

bigram_model <- twitter_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  count(bigram, sort = TRUE) %>%
  separate(bigram, into = c("w1", "w2"), sep = " ") %>%
  group_by(w1) %>%
  slice_max(n, n = 5) %>%
  ungroup()

trigram_model <- twitter_df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  count(trigram, sort = TRUE) %>%
  separate(trigram, into = c("w1", "w2", "w3"), sep = " ") %>%
  group_by(w1, w2) %>%
  slice_max(n, n = 3) %>%
  ungroup()

Bigrams = 2-word phrases; Trigrams = 3-word phrases

Used for context-based prediction

5. 🎯 Prediction Logic (Backoff)

predict_next_word <- function(input_text) {
  words <- tolower(str_split(input_text, " ")[[1]])
  len <- length(words)

  if (len >= 2) {
    pred <- trigram_model %>%
      filter(w1 == words[len - 1], w2 == words[len]) %>%
      arrange(desc(n)) %>%
      pull(w3)
    if (length(pred) > 0) return(pred[1])
  }

  if (len >= 1) {
    pred <- bigram_model %>%
      filter(w1 == words[len]) %>%
      arrange(desc(n)) %>%
      pull(w2)
    if (length(pred) > 0) return(pred[1])
  }

  return(unigram_model$word[1])
}

Tries trigram, then bigram, then unigram

Fallback to most frequent unigram if no match

6. Sample Predictions

test_phrases <- c(
  "I love", "You are the", "This is so",
  "I believe", "When I was", "The most important",
  "The president said", "According to the", "In the event"
)

results <- map_chr(test_phrases, predict_next_word)
data.frame(Phrase = test_phrases, Prediction = results)

##               Phrase Prediction
## 1             I love        you
## 2        You are the       best
## 3         This is so    excited
## 4          I believe         in
## 5         When I was          a
## 6 The most important      great
## 7 The president said          i
## 8   According to the     movies
## 9       In the event          i

7. 📈 Creative Enhancements

TF-IDF used for word importance analysis

Word2Vec explored for rare n-grams

Lookup tables compressed for deployment

Optional: Good-Turing smoothing

8. 📊 Performance & Coverage

library(tictoc)
tic("Prediction")
predict_next_word("I can't wait")

## [1] "to"

toc()

## Prediction: 0.006 sec elapsed

9. Análise de Cobertura de Vocabulário

total_words <- sum(unigram_model$n)
coverage_50 <- unigram_model %>%
  mutate(cum_sum = cumsum(n)) %>%
  filter(cum_sum < total_words * 0.5) %>%
  nrow()

coverage_90 <- unigram_model %>%
  mutate(cum_sum = cumsum(n)) %>%
  filter(cum_sum < total_words * 0.9) %>%
  nrow()

data.frame(coverage_50, coverage_90)

##   coverage_50 coverage_90
## 1         120        3674

10. 🙏 Thank You!

Full presentation: RPubs link

Built with R, tidyverse, Shiny, DT, and 💡

Connect with me on LinkedIn!