As noted earlier, a corpus is a body of text from which we build and test LMs.
rm(list = ls())
library(quanteda)
## Package version: 2.1.2
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
blogs <- "final/en_US/en_US.blogs.txt"
blog_line <- readLines(blogs,encoding="UTF-8", skipNul = TRUE)
news <- "final/en_US/en_US.news.txt"
news_line <- readLines(news,encoding="UTF-8", skipNul = TRUE)
twitter <- "final/en_US/en_US.twitter.txt"
twitter_line <- readLines(twitter,encoding="UTF-8", skipNul = TRUE)
.txt filesset.seed(42)
samplecorpus <- c(blog_line[sample(1:length(blog_line),length(blog_line)*0.25)],news_line[sample(1:length(news_line),length(news_line)*0.25)],twitter_line[sample(1:length(twitter_line),length(twitter_line)*0.25)])
blog_line,news_line and twitter_line files to free up memory since we don’t need them anymore.rm(list=c("blog_line","news_line","twitter_line"))
We will use the quanteda package to construct the n-gram tables. Many data scientists say it performs much faster than tm and RWeka for these types of tasks.
quanteda package.sample_corpus <- corpus(samplecorpus)
sample_corpus_tokens <- tokens(sample_corpus,what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE,
padding = FALSE)
sample_corpus_tokens <- tokens_tolower(sample_corpus_tokens)
sample_corpus_tokens <- tokens_wordstem(sample_corpus_tokens,
language = quanteda_options("language_stemmer"))
# sample_corpus_tokens <- tokens_select(sample_corpus_tokens, pattern = stopwords("en"), selection = "remove")
samplecorpusand sample_corpus files to free up memory since we don’t need them anymore.rm(list=c("samplecorpus","sample_corpus"))
dfm_sample_corpus_tokens <- dfm(sample_corpus_tokens)
unigrams_freq <- textstat_frequency(dfm_sample_corpus_tokens) # unigram frequency
unigrs <- subset(unigrams_freq,select=c(feature,frequency))
names(unigrs) <- c("ngram","freq")
unigrs <- as.data.table(unigrs)
bigrams <- dfm(tokens_ngrams(sample_corpus_tokens, n = 2))
bigrams_freq <- textstat_frequency(bigrams) # bigram frequency
bigrs <- subset(bigrams_freq,select=c(feature,frequency))
names(bigrs) <- c("ngram","freq")
bigrs <- as.data.table(bigrs)
trigrams <- dfm(tokens_ngrams(sample_corpus_tokens, n = 3))
trigrams_freq <- textstat_frequency(trigrams) # trigram frequency
trigrs <- subset(trigrams_freq,select=c(feature,frequency))
names(trigrs) <- c("ngram","freq")
trigrs <- as.data.table(trigrs)
head(unigrs);head(bigrs);head(trigrs)
## ngram freq
## 1: the 1191073
## 2: to 688894
## 3: and 603651
## 4: a 594822
## 5: of 501085
## 6: i 414593
## ngram freq
## 1: of_the 107170
## 2: in_the 103034
## 3: to_the 53450
## 4: for_the 50521
## 5: on_the 48929
## 6: to_be 41102
## ngram freq
## 1: one_of_the 8614
## 2: a_lot_of 7532
## 3: thank_for_the 6078
## 4: i_want_to 5372
## 5: to_be_a 4673
## 6: go_to_be 4413
dfm_sample_corpus_tokens, sample_corpus_tokens, and the n-gram frequency files to free up memory since we don’t need them anymore.rm(list=c("dfm_sample_corpus_tokens","sample_corpus_tokens","unigrams_freq","bigrams_freq","trigrams_freq","bigrams","trigrams"))
For this example, we’ll use \(\gamma_2 = \gamma_3 = 0.5\) for the purpose of illustration. In practice, these values would be obtained by cross-validation. A great treatment of cross-validation can be found in Chapter 5 of this (free) book: “An Introduction to Statistical Learning” by Gareth James, et al.
For this example, we’ll select the bigram: case of
The code below finds the observed trigrams starting with the selected bigram prefix and calculates their probabilities.
Recall that if we define \(\gamma_2\) to be the amount of discount taken from observed bigram counts, and \(\gamma_3\) the amount of discount taken from observed trigram counts, and \(c^∗\) to be the new discounted counts for observed bigrams and trigrams after applying the discount, then the backed off probability estimates would be written as:
\(q_{BO}(w_i|w_{i−1})=c^∗(w_{i−1},w)c(w_{i−1})\) for observed bigrams,
where \(c^∗(w_{i−1},w)=c(w_{i−1},w)−\gamma_2\),
and
\(q_{BO}(w_i|w_{i−2},w_{i−1})=c^∗(w_{i−2},w_{i−1},w)c(w_{i−2},w_{i−1})\) for observed trigrams,
where \(c^∗(w_{i−2},w_{i−1},w)=c(w_{i−2},w_{i−1},w)−\gamma_3\).
## Returns a two column data.frame of observed trigrams that start with the
## bigram prefix (bigPre) in the first column named ngram and
## frequencies/counts in the second column named freq. If no observed trigrams
## that start with bigPre exist, an empty data.frame is returned.
##
## bigPre - single-element char array of the form w2_w1 which are the first
## two words of the trigram we are predicting the tail word of
## trigrams - 2 column data.frame or data.table. The first column: ngram,
## contains all the trigrams in the corpus. The second column:
## freq, contains the frequency/count of each trigram.
getObsTrigs <- function(bigPre, trigrams) {
trigs.winA <- data.frame(ngrams=vector(mode = 'character', length = 0),
freq=vector(mode = 'integer', length = 0))
regex <- sprintf("%s%s%s", "^", bigPre, "_")
trigram_indices <- grep(regex, trigrams$ngram)
if(length(trigram_indices) > 0) {
trigs.winA <- trigrams[trigram_indices, ]
}
return(trigs.winA)
}
## Returns a two column data.frame of observed trigrams that start with bigram
## prefix bigPre in the first column named ngram and the probabilities
## q_bo(w_i | w_i-2, w_i-1) in the second column named prob calculated from
## eqn 12. If no observed trigrams starting with bigPre exist, NULL is returned.
##
## obsTrigs - 2 column data.frame or data.table. The first column: ngram,
## contains all the observed trigrams that start with the bigram
## prefix bigPre which we are attempting to the predict the next
## word of in a give phrase. The second column: freq, contains the
## frequency/count of each trigram.
## bigrs - 2 column data.frame or data.table. The first column: ngram,
## contains all the bigrams in the corpus. The second column:
## freq, contains the frequency/count of each bigram.
## bigPre - single-element char array of the form w2_w1 which are first two
## words of the trigram we are predicting the tail word of
## triDisc - amount to discount observed trigrams
getObsTriProbs <- function(obsTrigs, bigrs, bigPre, triDisc=0.5) {
if(nrow(obsTrigs) < 1) return(NULL)
obsCount <- subset(bigrs, ngram == bigPre)$freq[1]
obsTrigProbs <- mutate(obsTrigs, freq=((freq - triDisc) / obsCount))
colnames(obsTrigProbs) <- c("ngram", "prob")
return(obsTrigProbs)
}
gamma2 <- 0.5 # bigram discount
gamma3 <- 0.5 # trigram discount
bigPre <- 'case_of'
obs_trigrs <- getObsTrigs(bigPre, trigrs) # get trigrams and counts
# convert counts to probabilities
qbo_obs_trigrams <- getObsTriProbs(obs_trigrs, bigrs, bigPre, gamma3)
head(qbo_obs_trigrams)
## ngram prob
## 1: case_of_the 0.14897260
## 2: case_of_a 0.06906393
## 3: case_of_this 0.01541096
## 4: case_of_an 0.01541096
## 5: case_of_emerg 0.01426941
## 6: case_of_beer 0.01084475
This is the the most complex step as it involves backing off to the bigram level. Here is a breakdown of the sub-steps for these calculations:
## Returns a character vector which are the tail words of unobserved trigrams
## that start with the first two words of obsTrigs (aka the bigram prefix).
## These are the words w in the set B(w_i-2, w_i-1) as defined in the section
## describing the details of equation 17.
##
## obsTrigs - character vector of observed trigrams delimited by _ of the form:
## w3_w2_w1 where w3_w2 is the bigram prefix
## unigs - 2 column data.frame of all the unigrams in the corpus:
## ngram = unigram
## freq = frequency/count of each unigram
getUnobsTrigTails <- function(obsTrigs, unigs) {
obs_trig_tails <- str_split_fixed(obsTrigs, "_", 3)[, 3]
unobs_trig_tails <- unigs[!(unigs$ngram %in% obs_trig_tails), ]$ngram
return(unobs_trig_tails)
}
unobs_trig_tails <- getUnobsTrigTails(obs_trigrs$ngram, unigrs)
head(unobs_trig_tails)
## [1] "to" "and" "of" "i" "in" "is"
## Returns the total probability mass discounted from all observed bigrams
## calculated from equation 14. This is the amount of probability mass which
## is redistributed to UNOBSERVED bigrams. If no bigrams starting with
## unigram$ngram[1] exist, 0 is returned.
##
## unigram - single row, 2 column frequency table. The first column: ngram,
## contains the w_i-1 unigram (2nd word of the bigram prefix). The
## second column: freq, contains the frequency/count of this unigram.
## bigrams - 2 column data.frame or data.table. The first column: ngram,
## contains all the bigrams in the corpus. The second column:
## freq, contains the frequency or count of each bigram.
## bigDisc - amount to discount observed bigrams
getAlphaBigram <- function(unigram, bigrams, bigDisc=0.5) {
# get all bigrams that start with unigram
regex <- sprintf("%s%s%s", "^", unigram$ngram[1], "_")
bigsThatStartWithUnig <- bigrams[grep(regex, bigrams$ngram),]
if(nrow(bigsThatStartWithUnig) < 1) return(0)
alphaBi <- 1 - (sum(bigsThatStartWithUnig$freq - bigDisc) / unigram$freq)
return(alphaBi)
}
unig <- str_split(bigPre, "_")[[1]][2]
unig <- unigrs[unigrs$ngram == unig,]
alpha_big <- getAlphaBigram(unig, bigrs, gamma2)
alpha_big
## [1] 0.03618548
The code below calculates \(q_{BO}(w_i|w_{i−1})\) for observed and unobserved bigrams:
## Returns a character vector of backed off bigrams of the form w2_w1. These
## are all the (w_i-1, w) bigrams where w_i-1 is the tail word of the bigram
## prefix bigPre and w are the tail words of unobserved bigrams that start with
## w_i-1.
##
## bigPre - single-element char array of the form w2_w1 which are first two
## words of the trigram we are predicting the tail word of
## unobsTrigTails - character vector that are tail words of unobserved trigrams
getBOBigrams <- function(bigPre, unobsTrigTails) {
w_i_minus1 <- str_split(bigPre, "_")[[1]][2]
boBigrams <- paste(w_i_minus1, unobsTrigTails, sep = "_")
return(boBigrams)
}
## Returns a two column data.frame of backed-off bigrams in the first column
## named ngram and their frequency/counts in the second column named freq.
##
## bigPre - single-element char array of the form w2_w1 which are first two
## words of the trigram we are predicting the tail word of
## unobsTrigTails - character vector that are tail words of unobserved trigrams
## bigrs - 2 column data.frame or data.table. The first column: ngram,
## contains all the bigrams in the corpus. The second column:
## freq, contains the frequency/count of each bigram.
getObsBOBigrams <- function(bigPre, unobsTrigTails, bigrs) {
boBigrams <- getBOBigrams(bigPre, unobsTrigTails)
obs_bo_bigrams <- bigrs[bigrs$ngram %in% boBigrams, ]
return(obs_bo_bigrams)
}
## Returns a character vector of backed-off bigrams which are unobserved.
##
## bigPre - single-element char array of the form w2_w1 which are first two
## words of the trigram we are predicting the tail word of
## unobsTrigTails - character vector that are tail words of unobserved trigrams
## obsBOBigram - data.frame which contains the observed bigrams in a column
## named ngram
getUnobsBOBigrams <- function(bigPre, unobsTrigTails, obsBOBigram) {
boBigrams <- getBOBigrams(bigPre, unobsTrigTails)
unobs_bigs <- boBigrams[!(boBigrams %in% obsBOBigram$ngram)]
return(unobs_bigs)
}
## Returns a dataframe of 2 columns: ngram and probs. Values in the ngram
## column are bigrams of the form: w2_w1 which are observed as the last
## two words in unobserved trigrams. The values in the prob column are
## q_bo(w1 | w2) calculated from from equation 10.
##
## obsBOBigrams - a dataframe with 2 columns: ngram and freq. The ngram column
## contains bigrams of the form w1_w2 which are observed bigrams
## that are the last 2 words of unobserved trigrams (i.e. "backed
## off" bigrams). The freq column contains integers that are
## the counts of these observed bigrams in the corpus.
## unigs - 2 column data.frame of all the unigrams in the corpus:
## ngram = unigram
## freq = frequency/count of each unigram
## bigDisc - amount to discount observed bigrams
getObsBigProbs <- function(obsBOBigrams, unigs, bigDisc=0.5) {
first_words <- str_split_fixed(obsBOBigrams$ngram, "_", 2)[, 1]
first_word_freqs <- unigs[unigs$ngram %in% first_words, ]
obsBigProbs <- (obsBOBigrams$freq - bigDisc) / first_word_freqs$freq
obsBigProbs <- data.frame(ngram=obsBOBigrams$ngram, prob=obsBigProbs)
return(obsBigProbs)
}
## Returns a dataframe of 2 columns: ngram and prob. Values in the ngram
## column are unobserved bigrams of the form: w2_w1. The values in the prob
## column are the backed off probability estimates q_bo(w1 | w2) calculated
## from from equation 16.
##
## unobsBOBigrams - character vector of unobserved backed off bigrams
## unigs - 2 column data.frame of all the unigrams in the corpus:
## ngram = unigram
## freq = frequency/count of each unigram
## alphaBig - total discounted probability mass at the bigram level
getQboUnobsBigrams <- function(unobsBOBigrams, unigs, alphaBig) {
# get the unobserved bigram tails
qboUnobsBigs <- str_split_fixed(unobsBOBigrams, "_", 2)[, 2]
w_in_Aw_iminus1 <- unigs[!(unigs$ngram %in% qboUnobsBigs), ]
# convert to data.frame with counts
qboUnobsBigs <- unigs[unigs$ngram %in% qboUnobsBigs, ]
denom <- sum(qboUnobsBigs$freq)
# converts counts to probabilities
qboUnobsBigs <- data.frame(ngram=unobsBOBigrams,
prob=(alphaBig * qboUnobsBigs$freq / denom))
return(qboUnobsBigs)
}
# Get backed off bigrams
bo_bigrams <- getBOBigrams(bigPre, unobs_trig_tails)
# Separate bigrams into observed and unobserved using the appropriate equations
obs_bo_bigrams <- getObsBOBigrams(bigPre, unobs_trig_tails, bigrs)
unobs_bo_bigrams <- getUnobsBOBigrams(bigPre, unobs_trig_tails, obs_bo_bigrams)
# Calculate observed bigram probabilities
qbo_obs_bigrams <- getObsBigProbs(obs_bo_bigrams, unigrs, gamma2) #ngram probs
# Calculate alpha_big and unobserved bigram probabilities
unig <- str_split(bigPre, "_")[[1]][2]
unig <- unigrs[unigrs$ngram == unig,]
# Distribute discounted bigram probability mass to unobserved bigrams in proportion to unigram ML
qbo_unobs_bigrams <- getQboUnobsBigrams(unobs_bo_bigrams, unigrs, alpha_big)
qbo_bigrams <- rbind(qbo_obs_bigrams, qbo_unobs_bigrams)
head(qbo_bigrams)
## ngram prob
## 1 of_their 0.009506371
## 2 of_your 0.008853787
## 3 of_all 0.007331092
## 4 of_you 0.007175429
## 5 of_peopl 0.005381323
## 6 of_time 0.004443358
Before doing the final calculations for the unobserved trigrams, let’s do a simple check on our calculations at the bigram level. In the previous table, all the bigrams are unobserved which means that if we sum all the unobserved bigram probabilities, we should get the total bigram discount which is \(\alpha(w_{i−1})\). As we see below, this looks like it checks out.
unobs <- qbo_bigrams[-1,]
sum(unobs$prob)
## [1] 0.5111907
The getAlphaTrigram function shown below implements the following equation:
to compute the trigram discount. Here we use it to compute the trigram discount for \(q_{BO}(house|sell,the)\):
## Returns the total probability mass discounted from all observed trigrams.
## This is the amount of probability mass which is
## redistributed to UNOBSERVED trigrams. If no trigrams starting with
## bigram$ngram[1] exist, 1 is returned.
##
## obsTrigs - 2 column data.frame or data.table. The first column: ngram,
## contains all the observed trigrams that start with the bigram
## prefix we are attempting to the predict the next word of. The
## second column: freq, contains the frequency/count of each trigram.
## bigram - single row frequency table where the first col: ngram, is the bigram
## which are the first two words of unobserved trigrams we want to
## estimate probabilities of (same as bigPre in other functions listed
## prior) delimited with '_'. The second column: freq, is the
## frequency/count of the bigram listed in the ngram column.
## triDisc - amount to discount observed trigrams
getAlphaTrigram <- function(obsTrigs, bigram, triDisc=0.5) {
if(nrow(obsTrigs) < 1) return(1)
alphaTri <- 1 - sum((obsTrigs$freq - triDisc) / bigram$freq[1])
return(alphaTri)
}
bigram <- bigrs[bigrs$ngram %in% bigPre, ]
alpha_trig <- getAlphaTrigram(obs_trigrs, bigram, gamma3)
alpha_trig
## [1] 0.298516
## Returns a dataframe of 2 columns: ngram and prob. Values in the ngram
## column are unobserved trigrams of the form: w3_w2_w1. The values in the prob
## column are q_bo(w1 | w3, w2) calculated from equation 17.
##
## bigPre - single-element char array of the form w2_w1 which are first two
## words of the trigram we are predicting the tail word of
## qboObsBigrams - 2 column data.frame with the following columns -
## ngram: observed bigrams of the form w2_w1
## probs: the probability estimate for observed bigrams:
## qbo(w1 | w2) calc'd from equation 10.
## qboUnobsBigrams - 2 column data.frame with the following columns -
## ngram: unobserved bigrams of the form w2_w1
## probs: the probability estimate for unobserved bigrams
## qbo(w1 | w2) calc'd from equation 16.
## alphaTrig - total discounted probability mass at the trigram level
getUnobsTriProbs <- function(bigPre, qboObsBigrams,
qboUnobsBigrams, alphaTrig) {
qboBigrams <- rbind(qboObsBigrams, qboUnobsBigrams)
qboBigrams <- qboBigrams[order(-qboBigrams$prob), ]
sumQboBigs <- sum(qboBigrams$prob)
first_bigPre_word <- str_split(bigPre, "_")[[1]][1]
unobsTrigNgrams <- paste(first_bigPre_word, qboBigrams$ngram, sep="_")
unobsTrigProbs <- alphaTrig * qboBigrams$prob / sumQboBigs
unobsTrigDf <- data.frame(ngram=unobsTrigNgrams, prob=unobsTrigProbs)
return(unobsTrigDf)
}
qbo_unobs_trigrams <- getUnobsTriProbs(bigPre, qbo_obs_bigrams,
qbo_unobs_bigrams, alpha_trig)
head(qbo_unobs_trigrams)
## ngram prob
## 1 case_of_their 0.005450009
## 2 case_of_your 0.005075882
## 3 case_of_all 0.004202920
## 4 case_of_you 0.004113678
## 5 case_of_peopl 0.003085116
## 6 case_of_time 0.002547380
We’ve done all the calculations required to make our prediction. These are summarized in the table below:
getPredictionMsg <- function(qbo_trigs) {
# pull off tail word of highest prob trigram
prediction <- str_split(qbo_trigs$ngram[1], "_")[[1]][3]
result <- sprintf("%s%s%s%.4f", "highest prob prediction is >>> ", prediction," <<< which has probability = ", qbo_trigs$prob[1])
return(result)
}
qbo_trigrams <- rbind(qbo_obs_trigrams, qbo_unobs_trigrams)
qbo_trigrams <- qbo_trigrams[order(-qbo_trigrams$prob), ] # sort by desc prob
out_msg <- getPredictionMsg(qbo_trigrams)
out_msg
## [1] "highest prob prediction is >>> the <<< which has probability = 0.1490"
A good data scientist would investigate his/her results further, especially if they have not worked with a particular algorithm before. A simple first check might be to test whether all the \(q_{BO}(w_i|w_{i−2},w_{i−1})\) values sum to 1:
sum(qbo_trigrams$prob)
## [1] 1
That looks O.K., so let’s explore something more interesting to see if we can deepen our understanding.
Let’s redo the above calculations with increased discount rates at both bigram and trigram levels. If we increase our discount rates from 0.5 to 0.7, what happens? If we set \(\gamma_2 = \gamma_3 = 0.7\), these are the results we get:
gamma2=0.7; gamma3=0.7 # initialize new discount rates
obs_trigs <- getObsTrigs(bigPre, trigrs)
unobs_trig_tails <- getUnobsTrigTails(obs_trigs$ngram, unigrs)
bo_bigrams <- getBOBigrams(bigPre, unobs_trig_tails)
# Separate bigrams into observed and unobserved using the appropriate equations
obs_bo_bigrams <- getObsBOBigrams(bigPre, unobs_trig_tails, bigrs)
unobs_bo_bigrams <- getUnobsBOBigrams(bigPre, unobs_trig_tails, obs_bo_bigrams)
# Calculate observed bigram probabilites
qbo_obs_bigrams <- getObsBigProbs(obs_bo_bigrams, unigrs, gamma2)
# Calculate alpha_big and unobserved bigram probabilities
unig <- str_split(bigPre, "_")[[1]][2]
unig <- unigrs[unigrs$ngram == unig,]
alpha_big <- getAlphaBigram(unig, bigrs, gamma2)
# Distribute discounted bigram probability mass to unobserved bigrams in proportion to unigram ML
qbo_unobs_bigrams <- getQboUnobsBigrams(unobs_bo_bigrams, unigrs, alpha_big)
# Calculate observed trigram probabilities...
qbo_obs_trigrams <- getObsTriProbs(obs_trigs, bigrs, bigPre, gamma3)
# Finally, calculate unobserved trigram probabilities...
bigram <- bigrs[bigrs$ngram %in% bigPre, ]
alpha_trig <- getAlphaTrigram(obs_trigs, bigram, gamma3)
qbo_unobs_trigrams <- getUnobsTriProbs(bigPre, qbo_obs_bigrams,
qbo_unobs_bigrams, alpha_trig)
qbo_trigrams <- rbind(qbo_obs_trigrams, qbo_unobs_trigrams)
qbo_trigrams <- qbo_trigrams[order(-qbo_trigrams$prob), ]
getPredictionMsg(qbo_trigrams)
## [1] "highest prob prediction is >>> the <<< which has probability = 0.1487"