In this document, we compare two approaches to handling n-grams in a Natural Language Processing task:
All analysis is based on a sampled portion of the Twitter dataset:
"final/en_US/en_US.twitter.txt".
# Load Twitter sample
set.seed(123)
file_path <- "final/en_US/en_US.twitter.txt"
lines <- readLines(file_path, encoding = "UTF-8", skipNul = TRUE)
sample_lines <- sample(lines, size = 10000)
text_df <- tibble(line = 1:length(sample_lines), text = sample_lines)
bigrams <- text_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
bigram_freq <- bigrams %>%
count(bigram, sort = TRUE)
bigram_freq %>% head(10)
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 in the 365
## 2 for the 307
## 3 of the 254
## 4 on the 205
## 5 to be 188
## 6 at the 185
## 7 to the 180
## 8 thanks for 166
## 9 i love 150
## 10 for a 148
bigram_freq %>%
slice_max(n, n = 20) %>%
ggplot(aes(x = reorder(bigram, n), y = n)) +
geom_col(fill = "darkorange") +
coord_flip() +
labs(title = "Top 20 Bigrams (Exploratory)", x = "Bigram", y = "Frequency")
trigrams <- text_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)
trigram_freq <- trigrams %>%
count(trigram, sort = TRUE)
trigram_freq %>% head(10)
## # A tibble: 10 × 2
## trigram n
## <chr> <int>
## 1 <NA> 278
## 2 thanks for the 106
## 3 looking forward to 39
## 4 i love you 37
## 5 i want to 35
## 6 for the follow 33
## 7 going to be 32
## 8 i need to 32
## 9 thank you for 29
## 10 is going to 27
In this section, we convert n-gram frequency tables into predictive lookup models with fallback from trigram → bigram → unigram.
# Unigrams
tokens <- text_df %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "^[a-z']+$"))
unigram_model <- tokens %>%
count(word, sort = TRUE)
# Bigrams with context
bigram_model <- text_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE) %>%
separate(bigram, into = c("w1", "w2"), sep = " ") %>%
group_by(w1) %>%
slice_max(n, n = 5) %>%
ungroup()
# Trigrams with context
trigram_model <- text_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(trigram, sort = TRUE) %>%
separate(trigram, into = c("w1", "w2", "w3"), sep = " ") %>%
group_by(w1, w2) %>%
slice_max(n, n = 3) %>%
ungroup()
predict_next_word <- function(input_text) {
words <- tolower(str_split(input_text, " ")[[1]])
len <- length(words)
if (len >= 2) {
pred <- trigram_model %>%
filter(w1 == words[len - 1], w2 == words[len]) %>%
arrange(desc(n)) %>%
pull(w3)
if (length(pred) > 0) return(pred[1])
}
if (len >= 1) {
pred <- bigram_model %>%
filter(w1 == words[len]) %>%
arrange(desc(n)) %>%
pull(w2)
if (length(pred) > 0) return(pred[1])
}
return(unigram_model$word[1])
}
cat("Input: I love →", predict_next_word("I love"), "\n")
## Input: I love → you
cat("Input: How are →", predict_next_word("How are"), "\n")
## Input: How are → you
cat("Input: Thanks for →", predict_next_word("Thanks for"), "\n")
## Input: Thanks for → the
This document compares two approaches: - The exploratory analysis helps understand the structure and frequency of natural text. - The backoff model transforms this structure into a usable algorithm that mimics how predictive text works in practice.
Next steps include packaging this prediction into a Shiny interface.