Goal: Predict the next word using a backoff n-gram language model
Dataset: English corpora from Twitter, Blogs, and News
Approach:
Clean and tokenize input
Generate n-gram frequency tables
Use backoff algorithm to predict next word
2025-05-07
Goal: Predict the next word using a backoff n-gram language model
Dataset: English corpora from Twitter, Blogs, and News
Approach:
Clean and tokenize input
Generate n-gram frequency tables
Use backoff algorithm to predict next word
read_sample <- function(path, n = 5000, seed = 123) {
set.seed(seed)
lines <- readLines(path, encoding = "UTF-8", skipNul = TRUE)
sample(lines, size = n)
}
twitter_lines <- read_sample("data/en_US.twitter.txt")
blogs_lines <- read_sample("data/en_US.blogs.txt")
news_lines <- read_sample("data/en_US.news.txt")
Each file contains millions of lines
library(tidyverse)
## ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ ## β dplyr 1.1.4 β readr 2.1.5 ## β forcats 1.0.0 β stringr 1.5.1 ## β ggplot2 3.5.2 β tibble 3.2.1 ## β lubridate 1.9.4 β tidyr 1.3.1 ## β purrr 1.0.4 ## ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ ## β dplyr::filter() masks stats::filter() ## β dplyr::lag() masks stats::lag() ## βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext) library(stringr) twitter_df <- tibble(text = twitter_lines) tokens <- twitter_df %>% unnest_tokens(word, text) %>% filter(str_detect(word, "^[a-z']+$")) unigram_model <- tokens %>% count(word, sort = TRUE)
Text converted to lowercase
Only alphabetic words kept
bigram_model <- twitter_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE) %>%
separate(bigram, into = c("w1", "w2"), sep = " ") %>%
group_by(w1) %>%
slice_max(n, n = 5) %>%
ungroup()
trigram_model <- twitter_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(trigram, sort = TRUE) %>%
separate(trigram, into = c("w1", "w2", "w3"), sep = " ") %>%
group_by(w1, w2) %>%
slice_max(n, n = 3) %>%
ungroup()
Bigrams = 2-word phrases; Trigrams = 3-word phrases
predict_next_word <- function(input_text) {
words <- tolower(str_split(input_text, " ")[[1]])
len <- length(words)
if (len >= 2) {
pred <- trigram_model %>%
filter(w1 == words[len - 1], w2 == words[len]) %>%
arrange(desc(n)) %>%
pull(w3)
if (length(pred) > 0) return(pred[1])
}
if (len >= 1) {
pred <- bigram_model %>%
filter(w1 == words[len]) %>%
arrange(desc(n)) %>%
pull(w2)
if (length(pred) > 0) return(pred[1])
}
return(unigram_model$word[1])
}
Tries trigram, then bigram, then unigram
test_phrases <- c( "I love", "You are the", "This is so", "I believe", "When I was", "The most important", "The president said", "According to the", "In the event" ) results <- map_chr(test_phrases, predict_next_word) data.frame(Phrase = test_phrases, Prediction = results)
## Phrase Prediction ## 1 I love you ## 2 You are the best ## 3 This is so excited ## 4 I believe in ## 5 When I was a ## 6 The most important great ## 7 The president said i ## 8 According to the movies ## 9 In the event i
TF-IDF used for word importance analysis
Word2Vec explored for rare n-grams
Lookup tables compressed for deployment
Optional: Good-Turing smoothing
library(tictoc)
tic("Prediction")
predict_next_word("I can't wait")
## [1] "to"
toc()
## Prediction: 0.006 sec elapsed
total_words <- sum(unigram_model$n) coverage_50 <- unigram_model %>% mutate(cum_sum = cumsum(n)) %>% filter(cum_sum < total_words * 0.5) %>% nrow() coverage_90 <- unigram_model %>% mutate(cum_sum = cumsum(n)) %>% filter(cum_sum < total_words * 0.9) %>% nrow() data.frame(coverage_50, coverage_90)
## coverage_50 coverage_90 ## 1 120 3674
Full presentation: RPubs link
Built with R, tidyverse, Shiny, DT, and π‘
Connect with me on LinkedIn!