Example of Applying the Algorithm: The Little Corpus That Could

As noted earlier, a corpus is a body of text from which we build and test LMs. To illustrate how the mathematical formulation of the KBO Trigram model works, it’s helpful to look at a simple corpus that is small enough to easily keep track of the n-gram counts, but large enough to illustrate the impact of unobserved n-grams on the calculations.

rm(list = ls())
library(quanteda)
## Package version: 2.1.2
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

ltcorpus <- readLines("little_test_corpus1.txt")
## Warning in readLines("little_test_corpus1.txt"): incomplete final line found on
## 'little_test_corpus1.txt'
ltcorpus
## [1] "SOS buy the book EOS"    "SOS buy the book EOS"   
## [3] "SOS buy the book EOS"    "SOS buy the book EOS"   
## [5] "SOS sell the book EOS"   "SOS buy the house EOS"  
## [7] "SOS buy the house EOS"   "SOS paint the house EOS"

In this corpus, SOS and EOS are tokens used to denote start of sentence and end-of-sentence.

Step 1. i. Unigram, Bigram and Trigram counts

This work used the quanteda package written by Ken Benoit and Paul Nulty to construct the n-gram tables. Many data scientists say it performs much faster than tm and RWeka for these types of tasks and I tend to agree.

Get corpus of words and frequency of n-grams from text file…

lt_corpus <-  corpus(ltcorpus)
lt_corpus_tokens <- tokens(lt_corpus)

dfm_lt_corpus_tokens <- dfm(lt_corpus_tokens)
unigrams_freq <- textstat_frequency(dfm_lt_corpus_tokens)
unigrs <- subset(unigrams_freq,select=c(feature,frequency))
names(unigrs) <- c("ngram","freq")
unigrs <- as.data.table(unigrs)

bigrams <- dfm(tokens_ngrams(lt_corpus_tokens, n = 2))
bigrams_freq <- textstat_frequency(bigrams)
bigrs <- subset(bigrams_freq,select=c(feature,frequency))
names(bigrs) <- c("ngram","freq")
bigrs <- as.data.table(bigrs)

trigrams <- dfm(tokens_ngrams(lt_corpus_tokens, n = 3))
trigrams_freq <- textstat_frequency(trigrams)
trigrs <- subset(trigrams_freq,select=c(feature,frequency))
names(trigrs) <- c("ngram","freq")
trigrs <- as.data.table(trigrs)

unigrs;bigrs;trigrs
##    ngram freq
## 1:   sos    8
## 2:   the    8
## 3:   eos    8
## 4:   buy    6
## 5:  book    5
## 6: house    3
## 7:  sell    1
## 8: paint    1
##         ngram freq
##  1:   sos_buy    6
##  2:   buy_the    6
##  3:  the_book    5
##  4:  book_eos    5
##  5: the_house    3
##  6: house_eos    3
##  7:  sos_sell    1
##  8:  sell_the    1
##  9: sos_paint    1
## 10: paint_the    1
##              ngram freq
## 1:     sos_buy_the    6
## 2:    the_book_eos    5
## 3:    buy_the_book    4
## 4:   the_house_eos    3
## 5:   buy_the_house    2
## 6:    sos_sell_the    1
## 7:   sos_paint_the    1
## 8: paint_the_house    1
## 9:   sell_the_book    1

Step 1. ii. Selecting bigram and trigram discounts

For this example, we’ll use \(\gamma_2 = \gamma_3 = 0.5\) for the purpose of illustration. In practice, these values would be obtained by cross-validation. A great treatment of cross-validation can be found in Chapter 5 of this (free) book: “An Introduction to Statistical Learning” by Gareth James, et al.

Step 2. Select Bigram Prefix of Word to be Predicted

For this example, we’ll select the bigram: sell the

Step 3. Calculate Probabilities of Words Completing Observed Trigrams

The code below finds the observed trigrams starting with the selected bigram prefix and calculates their probabilities. In our simple example, we can look at the table of trigrams above and see that there is only one trigram that starts with sell the which is sell the book.

Recall that if we define \(\gamma_2\) to be the amount of discount taken from observed bigram counts, and \(\gamma_3\) the amount of discount taken from observed trigram counts, and \(c^∗\) to be the new discounted counts for observed bigrams and trigrams after applying the discount, then the backed off probability estimates would be written as:

\(q_{BO}(w_i|w_{i−1})=c^∗(w_{i−1},w)c(w_{i−1})\) for observed bigrams,

where \(c^∗(w_{i−1},w)=c(w_{i−1},w)−\gamma_2\),

and

\(q_{BO}(w_i|w_{i−2},w_{i−1})=c^∗(w_{i−2},w_{i−1},w)c(w_{i−2},w_{i−1})\) for observed trigrams,

where \(c^∗(w_{i−2},w_{i−1},w)=c(w_{i−2},w_{i−1},w)−\gamma_3\).

Applying these equations, we get \(q_{BO}(book|sell,the)=(1−0.5)/1=0.5\) which is also the result provided from the code below.

## Returns a two column data.frame of observed trigrams that start with the
## bigram prefix (bigPre) in the first column named ngram and
## frequencies/counts in the second column named freq. If no observed trigrams
## that start with bigPre exist, an empty data.frame is returned.
##
## bigPre -  single-element char array of the form w2_w1 which are the first 
##           two words of the trigram we are predicting the tail word of
## trigrams - 2 column data.frame or data.table. The first column: ngram,
##            contains all the trigrams in the corpus. The second column:
##            freq, contains the frequency/count of each trigram.
getObsTrigs <- function(bigPre, trigrams) {
    trigs.winA <- data.frame(ngrams=vector(mode = 'character', length = 0),
                             freq=vector(mode = 'integer', length = 0))
    regex <- sprintf("%s%s%s", "^", bigPre, "_")
    trigram_indices <- grep(regex, trigrams$ngram)
    if(length(trigram_indices) > 0) {
        trigs.winA <- trigrams[trigram_indices, ]
    }
    
    return(trigs.winA)
}
## Returns a two column data.frame of observed trigrams that start with bigram
## prefix bigPre in the first column named ngram and the probabilities
## q_bo(w_i | w_i-2, w_i-1) in the second column named prob calculated from
## eqn 12. If no observed trigrams starting with bigPre exist, NULL is returned.
##
## obsTrigs - 2 column data.frame or data.table. The first column: ngram,
##            contains all the observed trigrams that start with the bigram
##            prefix bigPre which we are attempting to the predict the next
##            word of in a give phrase. The second column: freq, contains the
##            frequency/count of each trigram.
## bigrs - 2 column data.frame or data.table. The first column: ngram,
##         contains all the bigrams in the corpus. The second column:
##         freq, contains the frequency/count of each bigram.
## bigPre -  single-element char array of the form w2_w1 which are first two
##           words of the trigram we are predicting the tail word of
## triDisc - amount to discount observed trigrams
getObsTriProbs <- function(obsTrigs, bigrs, bigPre, triDisc=0.5) {
    if(nrow(obsTrigs) < 1) return(NULL)
    obsCount <- subset(bigrs, ngram == bigPre)$freq[1]
    obsTrigProbs <- mutate(obsTrigs, freq=((freq - triDisc) / obsCount))
    colnames(obsTrigProbs) <- c("ngram", "prob")
    
    return(obsTrigProbs)
}
gamma2 <- 0.5  # bigram discount
gamma3 <- 0.5  # trigram discount
bigPre <- 'sell_the'

obs_trigrs <- getObsTrigs(bigPre, trigrs)  # get trigrams and counts
# convert counts to probabilities
qbo_obs_trigrams <- getObsTriProbs(obs_trigrs, bigrs, bigPre, gamma3)
qbo_obs_trigrams
##            ngram prob
## 1: sell_the_book  0.5

Step 4. Calculate Probabilities of Words Completing Unobserved Trigrams

This is the the most complex step as it involves backing off to the bigram level. Here is a breakdown of the sub-steps for these calculations:

  1. Find all the words that complete unobserved trigrams. These are the words in the set \(w\:\in\:\mathcal{B}(w_{i-2},\:w_{i-1})\) described earlier.  

 

  1. Calculate \(\alpha(w_{i-1}) = \left[1 - \sum\limits_{w\:\in\:\mathcal{A}(w_{i-1})} \frac{c^*(w_{i-1},\:w)}{c(w_{i-1})}\:\:\:\:\right]\)  

 

  1. Calculate \(q_{BO}\) for each bigram in the denominator of the following equation:
          \(q_{BO}(w_i\:|\:w_{i-2},\:w_{i-1}) = \alpha(w_{i-2},\:w_{i-1})\frac{q_{BO}(w_i\:|\:w_{i-1})}{\sum\limits_{w\:\in\:\mathcal{B}(w_{i-2},\:w_{i-1})}q_{BO}(w\:|\:w_{i-1})}\)

 

 

          using \(q_{BO}(w_i\:|\:w_{i-1}) = \frac{c^*(w_{i-1},\:w)}{c(w_{i-1})}\:\:\:\:\) if the bigram is observed, or

 

 

          \(q_{BO}(w_i\:|\:w_{i-1}) = \alpha(w_{i-1})\frac{q_{ML}(w_i)}{\sum\limits_{w\:\in\:\mathcal{B}(w_{i-1})}q_{ML}(w)} = \alpha(w_{i-1})\frac{c(w_i)}{\sum\limits_{w\:\in\:\mathcal{B}(w_{i-1})}c(w)}\) if it is unobserved.

 

 

  1. Calculate \(\alpha(w_{i−2},w_{i−1}) = \left [ 1 - \sum\limits_{w\:\in\:\mathcal{A}(w_{i-2},\:w_{i-1})} \frac{c^*(w_{i-2},\:w_{i-1},\:w)}{c(w_{i-2},\:w_{i-1})}\:\:\:\:\right]\)  

 

  1. Calculate \(q_{BO}(w_i\:|\:w_{i-2},\:w_{i-1}) = \alpha(w_{i-2},\:w_{i-1})\frac{q_{BO}(w_i\:|\:w_{i-1})}{\sum\limits_{w\:\in\:\mathcal{B}(w_{i-2},\:w_{i-1})}q_{BO}(w\:|\:w_{i-1})}\) for each \(w_i\)  

Step 4. i. Find unobserved trigram tail words:

## Returns a character vector which are the tail words of unobserved trigrams
## that start with the first two words of obsTrigs (aka the bigram prefix).
## These are the words w in the set B(w_i-2, w_i-1) as defined in the section
## describing the details of equation 17.
##
## obsTrigs - character vector of observed trigrams delimited by _ of the form:
##            w3_w2_w1 where w3_w2 is the bigram prefix
## unigs - 2 column data.frame of all the unigrams in the corpus:
##         ngram = unigram
##         freq = frequency/count of each unigram

getUnobsTrigTails <- function(obsTrigs, unigs) {
    obs_trig_tails <- str_split_fixed(obsTrigs, "_", 3)[, 3]
    unobs_trig_tails <- unigs[!(unigs$ngram %in% obs_trig_tails), ]$ngram
    return(unobs_trig_tails)
}
unobs_trig_tails <- getUnobsTrigTails(obs_trigrs$ngram, unigrs)
unobs_trig_tails
## [1] "sos"   "the"   "eos"   "buy"   "house" "sell"  "paint"

Step 4. ii. Calculate discounted probability mass at the bigram level \(\alpha(w_{i−1})\):

## Returns the total probability mass discounted from all observed bigrams
## calculated from equation 14.  This is the amount of probability mass which
## is redistributed to UNOBSERVED bigrams. If no bigrams starting with
## unigram$ngram[1] exist, 0 is returned.
##
## unigram - single row, 2 column frequency table. The first column: ngram,
##           contains the w_i-1 unigram (2nd word of the bigram prefix). The
##           second column: freq, contains the frequency/count of this unigram.
## bigrams - 2 column data.frame or data.table. The first column: ngram,
##           contains all the bigrams in the corpus. The second column:
##           freq, contains the frequency or count of each bigram.
## bigDisc - amount to discount observed bigrams

getAlphaBigram <- function(unigram, bigrams, bigDisc=0.5) {
    # get all bigrams that start with unigram
    regex <- sprintf("%s%s%s", "^", unigram$ngram[1], "_")
    bigsThatStartWithUnig <- bigrams[grep(regex, bigrams$ngram),]
    if(nrow(bigsThatStartWithUnig) < 1) return(0)
    alphaBi <- 1 - (sum(bigsThatStartWithUnig$freq - bigDisc) / unigram$freq)
    
    return(alphaBi)
}
unig <- str_split(bigPre, "_")[[1]][2]
unig <- unigrs[unigrs$ngram == unig,]
alpha_big <- getAlphaBigram(unig, bigrs, gamma2)
alpha_big
## [1] 0.125

Step 4. iii. Calculate backed off probabilities \(q_{BO}\) for bigrams

The code below calculates \(q_{BO}(w_i|w_{i−1})\) for observed and unobserved bigrams:

## Returns a character vector of backed off bigrams of the form w2_w1. These 
## are all the (w_i-1, w) bigrams where w_i-1 is the tail word of the bigram
## prefix bigPre and w are the tail words of unobserved bigrams that start with
## w_i-1.
##
## bigPre - single-element char array of the form w2_w1 which are first two
##          words of the trigram we are predicting the tail word of
## unobsTrigTails - character vector that are tail words of unobserved trigrams
getBOBigrams <- function(bigPre, unobsTrigTails) {
    w_i_minus1 <- str_split(bigPre, "_")[[1]][2]
    boBigrams <- paste(w_i_minus1, unobsTrigTails, sep = "_")
    return(boBigrams)
}
## Returns a two column data.frame of backed-off bigrams in the first column
## named ngram and their frequency/counts in the second column named freq.
## 
## bigPre -  single-element char array of the form w2_w1 which are first two
##           words of the trigram we are predicting the tail word of
## unobsTrigTails - character vector that are tail words of unobserved trigrams
## bigrs - 2 column data.frame or data.table. The first column: ngram,
##         contains all the bigrams in the corpus. The second column:
##         freq, contains the frequency/count of each bigram.
getObsBOBigrams <- function(bigPre, unobsTrigTails, bigrs) {
    boBigrams <- getBOBigrams(bigPre, unobsTrigTails)
    obs_bo_bigrams <- bigrs[bigrs$ngram %in% boBigrams, ]
    return(obs_bo_bigrams)
}
## Returns a character vector of backed-off bigrams which are unobserved.
##
## bigPre -  single-element char array of the form w2_w1 which are first two
##           words of the trigram we are predicting the tail word of
## unobsTrigTails - character vector that are tail words of unobserved trigrams
## obsBOBigram - data.frame which contains the observed bigrams in a column
##               named ngram
getUnobsBOBigrams <- function(bigPre, unobsTrigTails, obsBOBigram) {
    boBigrams <- getBOBigrams(bigPre, unobsTrigTails)
    unobs_bigs <- boBigrams[!(boBigrams %in% obsBOBigram$ngram)]
    return(unobs_bigs)
}
## Returns a dataframe of 2 columns: ngram and probs.  Values in the ngram
## column are bigrams of the form: w2_w1 which are observed as the last
## two words in unobserved trigrams.  The values in the prob column are
## q_bo(w1 | w2) calculated from from equation 10.
##
## obsBOBigrams - a dataframe with 2 columns: ngram and freq. The ngram column
##                contains bigrams of the form w1_w2 which are observed bigrams
##                that are the last 2 words of unobserved trigrams (i.e. "backed
##                off" bigrams). The freq column contains integers that are
##                the counts of these observed bigrams in the corpus.
## unigs - 2 column data.frame of all the unigrams in the corpus:
##         ngram = unigram
##         freq = frequency/count of each unigram
## bigDisc - amount to discount observed bigrams
getObsBigProbs <- function(obsBOBigrams, unigs, bigDisc=0.5) {
    first_words <- str_split_fixed(obsBOBigrams$ngram, "_", 2)[, 1]
    first_word_freqs <- unigs[unigs$ngram %in% first_words, ]
    obsBigProbs <- (obsBOBigrams$freq - bigDisc) / first_word_freqs$freq
    obsBigProbs <- data.frame(ngram=obsBOBigrams$ngram, prob=obsBigProbs)
    
    return(obsBigProbs)
}
## Returns a dataframe of 2 columns: ngram and prob.  Values in the ngram
## column are unobserved bigrams of the form: w2_w1.  The values in the prob
## column are the backed off probability estimates q_bo(w1 | w2) calculated
## from from equation 16.
##
## unobsBOBigrams - character vector of unobserved backed off bigrams
## unigs - 2 column data.frame of all the unigrams in the corpus:
##         ngram = unigram
##         freq = frequency/count of each unigram
## alphaBig - total discounted probability mass at the bigram level
getQboUnobsBigrams <- function(unobsBOBigrams, unigs, alphaBig) {
    # get the unobserved bigram tails
    qboUnobsBigs <- str_split_fixed(unobsBOBigrams, "_", 2)[, 2]
    w_in_Aw_iminus1 <- unigs[!(unigs$ngram %in% qboUnobsBigs), ]
    # convert to data.frame with counts
    qboUnobsBigs <- unigs[unigs$ngram %in% qboUnobsBigs, ]
    denom <- sum(qboUnobsBigs$freq)
    # converts counts to probabilities
    qboUnobsBigs <- data.frame(ngram=unobsBOBigrams,
                               prob=(alphaBig * qboUnobsBigs$freq / denom))
    
    return(qboUnobsBigs)
}
# get backed off bigrams
bo_bigrams <- getBOBigrams(bigPre, unobs_trig_tails)  
# separate bigrams which use eqn 10 and those that use 16
obs_bo_bigrams <- getObsBOBigrams(bigPre, unobs_trig_tails, bigrs)
unobs_bo_bigrams <- getUnobsBOBigrams(bigPre, unobs_trig_tails, obs_bo_bigrams)
# unobs_bo_bigrams = c("the_buy", "the_EOS", "the_paint", "the_sell", "the_the")
# calc obs'd bigram prob's from eqn 10
qbo_obs_bigrams <- getObsBigProbs(obs_bo_bigrams, unigrs, gamma2) #ngram     probs
# calc alpha_big & unobs'd bigram prob's from eqn 16             #the_house 0.3125
unig <- str_split(bigPre, "_")[[1]][2]
unig <- unigrs[unigrs$ngram == unig,]
# distrib discounted bigram prob mass to unobs bigrams in prop to unigram ML
qbo_unobs_bigrams <- getQboUnobsBigrams(unobs_bo_bigrams, unigrs, alpha_big)
qbo_bigrams <- rbind(qbo_obs_bigrams, qbo_unobs_bigrams)
qbo_bigrams
##       ngram       prob
## 1 the_house 0.31250000
## 2   the_sos 0.03125000
## 3   the_the 0.03125000
## 4   the_eos 0.03125000
## 5   the_buy 0.02343750
## 6  the_sell 0.00390625
## 7 the_paint 0.00390625

Checking the Bigram Calculations

Before doing the final calculations for the unobserved trigrams, let’s do a simple check on our calculations at the bigram level. In the previous table, all the bigrams except the_house are unobserved which means that if we sum all the unobserved bigram probabilities, we should get the total bigram discount which is \(\alpha(w_{i−1})\). As we see below, this looks like it checks out.

unobs <- qbo_bigrams[-1,]
sum(unobs$prob)
## [1] 0.125

Step 4. iv. Calculate discounted probability mass at the trigram level \(\alpha(w_{i−2},w_{i−1})\)

The getAlphaTrigram function shown below implements the following equation:  

           \(\alpha(w_{i-2},\:w_{i-1}) =\left[ 1 - \sum\limits_{w\:\in\:\mathcal{A}(w_{i-2},\:w_{i-1})} \frac{c^*(w_{i-2},\:w_{i-1},\:w)}{c(w_{i-2},\:w_{i-1})}\:\:\:\:\right]\)

 

to compute the trigram discount. Here we use it to compute the trigram discount for \(q_{BO}(house|sell,the)\):

## Returns the total probability mass discounted from all observed trigrams.
## This is the amount of probability mass which is
## redistributed to UNOBSERVED trigrams. If no trigrams starting with
## bigram$ngram[1] exist, 1 is returned.
##
## obsTrigs - 2 column data.frame or data.table. The first column: ngram,
##            contains all the observed trigrams that start with the bigram
##            prefix we are attempting to the predict the next word of. The 
##            second column: freq, contains the frequency/count of each trigram.
## bigram - single row frequency table where the first col: ngram, is the bigram
##          which are the first two words of unobserved trigrams we want to
##          estimate probabilities of (same as bigPre in other functions listed
##          prior) delimited with '_'. The second column: freq, is the
##          frequency/count of the bigram listed in the ngram column.
## triDisc - amount to discount observed trigrams
getAlphaTrigram <- function(obsTrigs, bigram, triDisc=0.5) {
    if(nrow(obsTrigs) < 1) return(1)
    alphaTri <- 1 - sum((obsTrigs$freq - triDisc) / bigram$freq[1])
    
    return(alphaTri)
}
bigram <- bigrs[bigrs$ngram %in% bigPre, ]
alpha_trig <- getAlphaTrigram(obs_trigrs, bigram, gamma3)
alpha_trig
## [1] 0.5

Step 4. v. Calculate unobserved trigram probabilities \(q_{BO}(w_i|w_{i−2},w_{i−1})\):

## Returns a dataframe of 2 columns: ngram and prob.  Values in the ngram
## column are unobserved trigrams of the form: w3_w2_w1.  The values in the prob
## column are q_bo(w1 | w3, w2) calculated from equation 17.
##
## bigPre -  single-element char array of the form w2_w1 which are first two
##           words of the trigram we are predicting the tail word of
## qboObsBigrams - 2 column data.frame with the following columns -
##                 ngram: observed bigrams of the form w2_w1
##                 probs: the probability estimate for observed bigrams:
##                        qbo(w1 | w2) calc'd from equation 10.
## qboUnobsBigrams - 2 column data.frame with the following columns -
##                   ngram: unobserved bigrams of the form w2_w1
##                   probs: the probability estimate for unobserved bigrams
##                          qbo(w1 | w2) calc'd from equation 16.
## alphaTrig - total discounted probability mass at the trigram level
getUnobsTriProbs <- function(bigPre, qboObsBigrams,
                             qboUnobsBigrams, alphaTrig) {
    qboBigrams <- rbind(qboObsBigrams, qboUnobsBigrams)
    qboBigrams <- qboBigrams[order(-qboBigrams$prob), ]
    sumQboBigs <- sum(qboBigrams$prob)
    first_bigPre_word <- str_split(bigPre, "_")[[1]][1]
    unobsTrigNgrams <- paste(first_bigPre_word, qboBigrams$ngram, sep="_")
    unobsTrigProbs <- alphaTrig * qboBigrams$prob / sumQboBigs
    unobsTrigDf <- data.frame(ngram=unobsTrigNgrams, prob=unobsTrigProbs)
    
    return(unobsTrigDf)
}
qbo_unobs_trigrams <- getUnobsTriProbs(bigPre, qbo_obs_bigrams,
                                       qbo_unobs_bigrams, alpha_trig)
qbo_unobs_trigrams
##            ngram        prob
## 1 sell_the_house 0.357142857
## 2   sell_the_sos 0.035714286
## 3   sell_the_the 0.035714286
## 4   sell_the_eos 0.035714286
## 5   sell_the_buy 0.026785714
## 6  sell_the_sell 0.004464286
## 7 sell_the_paint 0.004464286

Step 5. Select \(w_i\) with the highest \(q_{BO}(w_i|w_{i−2},w_{i−1})\)

We’ve done all the calculations required to make our prediction. These are summarized in the table below:

getPredictionMsg <- function(qbo_trigs) {
    # pull off tail word of highest prob trigram
    prediction <- str_split(qbo_trigs$ngram[1], "_")[[1]][3]
    result <- sprintf("%s%s%s%.4f", "highest prob prediction is >>> ", prediction," <<< which has probability = ", qbo_trigs$prob[1])
    return(result)
}
qbo_trigrams <- rbind(qbo_obs_trigrams, qbo_unobs_trigrams)
qbo_trigrams <- qbo_trigrams[order(-qbo_trigrams$prob), ]  # sort by desc prob
out_msg <- getPredictionMsg(qbo_trigrams)
out_msg
## [1] "highest prob prediction is >>> book <<< which has probability = 0.5000"

Going a little deeper: Exploring an interesting question

Any good data scientist at this point would be asking themselves some questions about their results, especially if they have not worked with a particular algorithm before. A simple first check might be to test whether all the \(q_{BO}(w_i|w_{i−2},w_{i−1})\) values sum to 1:

sum(qbo_trigrams$prob)
## [1] 1

That looks O.K., so let’s explore something more interesting to see if we can deepen our understanding.

In the example we just completed, our prediction of book was based on the fact that \(q_{BO}(book|sell,the)\) was higher than any other \(q_{BO}(wi|sell,the)\). But this wasn’t really very interesting because sell the book was an observed trigram and the next closest probability \(q_{BO}(house|sell,the)\) was based on an unobserved trigram sell the house. This leads us to wonder if observed trigrams always trump unobserved trigrams.

We can prove to ourselves that this is not the case with a simple experiment. Let’s redo the above calculations with increased discount rates at both bigram and trigram levels. If we increase our discount rates from 0.5 to 0.7, what happens? If we set \(\gamma_2 = \gamma_3 = 0.7\), these are the results we get:

gamma2=0.7; gamma3=0.7  # initialize new discount rates

obs_trigs <- getObsTrigs(bigPre, trigrs)
unobs_trig_tails <- getUnobsTrigTails(obs_trigs$ngram, unigrs)
bo_bigrams <- getBOBigrams(bigPre, unobs_trig_tails)
# separate bigrams which use eqn 10 and those that use 16
obs_bo_bigrams <- getObsBOBigrams(bigPre, unobs_trig_tails, bigrs)
unobs_bo_bigrams <- getUnobsBOBigrams(bigPre, unobs_trig_tails, obs_bo_bigrams)
# calc obs'd bigram prob's from eqn 10
qbo_obs_bigrams <- getObsBigProbs(obs_bo_bigrams, unigrs, gamma2)
# calc alpha_big & unobs'd bigram prob's from eqn 16
unig <- str_split(bigPre, "_")[[1]][2]
unig <- unigrs[unigrs$ngram == unig,]
alpha_big <- getAlphaBigram(unig, bigrs, gamma2)
# distrib discounted bigram prob mass to unobs bigrams in prop to unigram ML
qbo_unobs_bigrams <- getQboUnobsBigrams(unobs_bo_bigrams, unigrs, alpha_big)
# calc trigram probabilities - start with observed trigrams: eqn 12
qbo_obs_trigrams <- getObsTriProbs(obs_trigs, bigrs, bigPre, gamma3)
# finally, calc trigram unobserved probabilities: eqn 17
bigram <- bigrs[bigrs$ngram %in% bigPre, ]
alpha_trig <- getAlphaTrigram(obs_trigs, bigram, gamma3)
qbo_unobs_trigrams <- getUnobsTriProbs(bigPre, qbo_obs_bigrams,
                                       qbo_unobs_bigrams, alpha_trig)
qbo_trigrams <- rbind(qbo_obs_trigrams, qbo_unobs_trigrams)
qbo_trigrams <- qbo_trigrams[order(-qbo_trigrams$prob), ]
getPredictionMsg(qbo_trigrams)
## [1] "highest prob prediction is >>> house <<< which has probability = 0.4351"