Data Science in R Specialization

Applying the Katz Backoff Algorithm:

As noted earlier, a corpus is a body of text from which we build and test LMs.

rm(list = ls())
library(quanteda)

## Package version: 2.1.2

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

library(data.table)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)

First, we read in the text files…

blogs <- "final/en_US/en_US.blogs.txt"
blog_line <- readLines(blogs,encoding="UTF-8", skipNul = TRUE)

news <- "final/en_US/en_US.news.txt"
news_line <- readLines(news,encoding="UTF-8", skipNul = TRUE)

twitter <- "final/en_US/en_US.twitter.txt"
twitter_line <- readLines(twitter,encoding="UTF-8", skipNul = TRUE)

Create a sample dataset using 25% each of the `.txt` files

set.seed(42)

samplecorpus <- c(blog_line[sample(1:length(blog_line),length(blog_line)*0.25)],news_line[sample(1:length(news_line),length(news_line)*0.25)],twitter_line[sample(1:length(twitter_line),length(twitter_line)*0.25)])

Delete `blog_line`,`news_line` and `twitter_line` files to free up memory since we don’t need them anymore.

rm(list=c("blog_line","news_line","twitter_line"))

Step 1. i. Unigram, Bigram and Trigram counts

We will use the quanteda package to construct the n-gram tables. Many data scientists say it performs much faster than tm and RWeka for these types of tasks.

Create a corpus of words and then clean it using the `quanteda` package.

sample_corpus <-  corpus(samplecorpus)
sample_corpus_tokens <- tokens(sample_corpus,what = "word",
                        remove_punct = TRUE,
                        remove_symbols = TRUE,
                        remove_numbers = TRUE,
                        remove_url = TRUE,
                        remove_separators = TRUE,
                        split_hyphens = FALSE,
                        include_docvars = TRUE,
                        padding = FALSE)
sample_corpus_tokens <- tokens_tolower(sample_corpus_tokens)
sample_corpus_tokens <- tokens_wordstem(sample_corpus_tokens, 
                                 language = quanteda_options("language_stemmer"))
# sample_corpus_tokens <- tokens_select(sample_corpus_tokens, pattern = stopwords("en"), selection = "remove")

Delete `samplecorpus`and `sample_corpus` files to free up memory since we don’t need them anymore.

rm(list=c("samplecorpus","sample_corpus"))

Get frequencies of n-grams from the sample corpus…

dfm_sample_corpus_tokens <- dfm(sample_corpus_tokens)
unigrams_freq <- textstat_frequency(dfm_sample_corpus_tokens)  # unigram frequency
unigrs <- subset(unigrams_freq,select=c(feature,frequency))
names(unigrs) <- c("ngram","freq")
unigrs <- as.data.table(unigrs)

bigrams <- dfm(tokens_ngrams(sample_corpus_tokens, n = 2))
bigrams_freq <- textstat_frequency(bigrams)                    # bigram frequency
bigrs <- subset(bigrams_freq,select=c(feature,frequency))
names(bigrs) <- c("ngram","freq")
bigrs <- as.data.table(bigrs)

trigrams <- dfm(tokens_ngrams(sample_corpus_tokens, n = 3))
trigrams_freq <- textstat_frequency(trigrams)                  # trigram frequency
trigrs <- subset(trigrams_freq,select=c(feature,frequency))
names(trigrs) <- c("ngram","freq")
trigrs <- as.data.table(trigrs)

head(unigrs);head(bigrs);head(trigrs)

##    ngram    freq
## 1:   the 1191073
## 2:    to  688894
## 3:   and  603651
## 4:     a  594822
## 5:    of  501085
## 6:     i  414593

##      ngram   freq
## 1:  of_the 107170
## 2:  in_the 103034
## 3:  to_the  53450
## 4: for_the  50521
## 5:  on_the  48929
## 6:   to_be  41102

##            ngram freq
## 1:    one_of_the 8614
## 2:      a_lot_of 7532
## 3: thank_for_the 6078
## 4:     i_want_to 5372
## 5:       to_be_a 4673
## 6:      go_to_be 4413

Delete `dfm_sample_corpus_tokens`, `sample_corpus_tokens`, and the n-gram frequency files to free up memory since we don’t need them anymore.

rm(list=c("dfm_sample_corpus_tokens","sample_corpus_tokens","unigrams_freq","bigrams_freq","trigrams_freq","bigrams","trigrams"))

Step 1. ii. Selecting bigram and trigram discounts

For this example, we’ll use \(\gamma_2 = \gamma_3 = 0.5\) for the purpose of illustration. In practice, these values would be obtained by cross-validation. A great treatment of cross-validation can be found in Chapter 5 of this (free) book: “An Introduction to Statistical Learning” by Gareth James, et al.

Step 2. Select Bigram Prefix of Word to be Predicted

For this example, we’ll select the bigram: case of

Step 3. Calculate Probabilities of Words Completing Observed Trigrams

The code below finds the observed trigrams starting with the selected bigram prefix and calculates their probabilities.

Recall that if we define \(\gamma_2\) to be the amount of discount taken from observed bigram counts, and \(\gamma_3\) the amount of discount taken from observed trigram counts, and \(c^∗\) to be the new discounted counts for observed bigrams and trigrams after applying the discount, then the backed off probability estimates would be written as:

\(q_{BO}(w_i|w_{i−1})=c^∗(w_{i−1},w)c(w_{i−1})\) for observed bigrams,

where \(c^∗(w_{i−1},w)=c(w_{i−1},w)−\gamma_2\),

and

\(q_{BO}(w_i|w_{i−2},w_{i−1})=c^∗(w_{i−2},w_{i−1},w)c(w_{i−2},w_{i−1})\) for observed trigrams,

where \(c^∗(w_{i−2},w_{i−1},w)=c(w_{i−2},w_{i−1},w)−\gamma_3\).

## Returns a two column data.frame of observed trigrams that start with the
## bigram prefix (bigPre) in the first column named ngram and
## frequencies/counts in the second column named freq. If no observed trigrams
## that start with bigPre exist, an empty data.frame is returned.
##
## bigPre -  single-element char array of the form w2_w1 which are the first 
##           two words of the trigram we are predicting the tail word of
## trigrams - 2 column data.frame or data.table. The first column: ngram,
##            contains all the trigrams in the corpus. The second column:
##            freq, contains the frequency/count of each trigram.
getObsTrigs <- function(bigPre, trigrams) {
    trigs.winA <- data.frame(ngrams=vector(mode = 'character', length = 0),
                             freq=vector(mode = 'integer', length = 0))
    regex <- sprintf("%s%s%s", "^", bigPre, "_")
    trigram_indices <- grep(regex, trigrams$ngram)
    if(length(trigram_indices) > 0) {
        trigs.winA <- trigrams[trigram_indices, ]
    }
    
    return(trigs.winA)
}

## Returns a two column data.frame of observed trigrams that start with bigram
## prefix bigPre in the first column named ngram and the probabilities
## q_bo(w_i | w_i-2, w_i-1) in the second column named prob calculated from
## eqn 12. If no observed trigrams starting with bigPre exist, NULL is returned.
##
## obsTrigs - 2 column data.frame or data.table. The first column: ngram,
##            contains all the observed trigrams that start with the bigram
##            prefix bigPre which we are attempting to the predict the next
##            word of in a give phrase. The second column: freq, contains the
##            frequency/count of each trigram.
## bigrs - 2 column data.frame or data.table. The first column: ngram,
##         contains all the bigrams in the corpus. The second column:
##         freq, contains the frequency/count of each bigram.
## bigPre -  single-element char array of the form w2_w1 which are first two
##           words of the trigram we are predicting the tail word of
## triDisc - amount to discount observed trigrams
getObsTriProbs <- function(obsTrigs, bigrs, bigPre, triDisc=0.5) {
    if(nrow(obsTrigs) < 1) return(NULL)
    obsCount <- subset(bigrs, ngram == bigPre)$freq[1]
    obsTrigProbs <- mutate(obsTrigs, freq=((freq - triDisc) / obsCount))
    colnames(obsTrigProbs) <- c("ngram", "prob")
    
    return(obsTrigProbs)
}

gamma2 <- 0.5  # bigram discount
gamma3 <- 0.5  # trigram discount
bigPre <- 'case_of'

obs_trigrs <- getObsTrigs(bigPre, trigrs)  # get trigrams and counts
# convert counts to probabilities
qbo_obs_trigrams <- getObsTriProbs(obs_trigrs, bigrs, bigPre, gamma3)
head(qbo_obs_trigrams)

##            ngram       prob
## 1:   case_of_the 0.14897260
## 2:     case_of_a 0.06906393
## 3:  case_of_this 0.01541096
## 4:    case_of_an 0.01541096
## 5: case_of_emerg 0.01426941
## 6:  case_of_beer 0.01084475

Step 4. Calculate Probabilities of Words Completing Unobserved Trigrams

This is the the most complex step as it involves backing off to the bigram level. Here is a breakdown of the sub-steps for these calculations:

Find all the words that complete unobserved trigrams. These are the words in the set \(w\:\in\:\mathcal{B}(w_{i-2},\:w_{i-1})\) described earlier.

Calculate \(\alpha(w_{i-1}) = \left[1 - \sum\limits_{w\:\in\:\mathcal{A}(w_{i-1})} \frac{c^*(w_{i-1},\:w)}{c(w_{i-1})}\:\:\:\:\right]\)

Calculate \(q_{BO}\) for each bigram in the denominator of the following equation:

\(q_{BO}(w_i\:|\:w_{i-2},\:w_{i-1}) = \alpha(w_{i-2},\:w_{i-1})\frac{q_{BO}(w_i\:|\:w_{i-1})}{\sum\limits_{w\:\in\:\mathcal{B}(w_{i-2},\:w_{i-1})}q_{BO}(w\:|\:w_{i-1})}\)

using \(q_{BO}(w_i\:|\:w_{i-1}) = \frac{c^*(w_{i-1},\:w)}{c(w_{i-1})}\:\:\:\:\) if the bigram is observed, or

\(q_{BO}(w_i\:|\:w_{i-1}) = \alpha(w_{i-1})\frac{q_{ML}(w_i)}{\sum\limits_{w\:\in\:\mathcal{B}(w_{i-1})}q_{ML}(w)} = \alpha(w_{i-1})\frac{c(w_i)}{\sum\limits_{w\:\in\:\mathcal{B}(w_{i-1})}c(w)}\) if it is unobserved.

Calculate \(\alpha(w_{i−2},w_{i−1}) = \left [ 1 - \sum\limits_{w\:\in\:\mathcal{A}(w_{i-2},\:w_{i-1})} \frac{c^*(w_{i-2},\:w_{i-1},\:w)}{c(w_{i-2},\:w_{i-1})}\:\:\:\:\right]\)

Calculate \(q_{BO}(w_i\:|\:w_{i-2},\:w_{i-1}) = \alpha(w_{i-2},\:w_{i-1})\frac{q_{BO}(w_i\:|\:w_{i-1})}{\sum\limits_{w\:\in\:\mathcal{B}(w_{i-2},\:w_{i-1})}q_{BO}(w\:|\:w_{i-1})}\) for each \(w_i\)

Step 4. i. Find unobserved trigram tail words:

## Returns a character vector which are the tail words of unobserved trigrams
## that start with the first two words of obsTrigs (aka the bigram prefix).
## These are the words w in the set B(w_i-2, w_i-1) as defined in the section
## describing the details of equation 17.
##
## obsTrigs - character vector of observed trigrams delimited by _ of the form:
##            w3_w2_w1 where w3_w2 is the bigram prefix
## unigs - 2 column data.frame of all the unigrams in the corpus:
##         ngram = unigram
##         freq = frequency/count of each unigram

getUnobsTrigTails <- function(obsTrigs, unigs) {
    obs_trig_tails <- str_split_fixed(obsTrigs, "_", 3)[, 3]
    unobs_trig_tails <- unigs[!(unigs$ngram %in% obs_trig_tails), ]$ngram
    return(unobs_trig_tails)
}

unobs_trig_tails <- getUnobsTrigTails(obs_trigrs$ngram, unigrs)
head(unobs_trig_tails)

## [1] "to"  "and" "of"  "i"   "in"  "is"

Step 4. ii. Calculate discounted probability mass at the bigram level \(\alpha(w_{i−1})\):

## Returns the total probability mass discounted from all observed bigrams
## calculated from equation 14.  This is the amount of probability mass which
## is redistributed to UNOBSERVED bigrams. If no bigrams starting with
## unigram$ngram[1] exist, 0 is returned.
##
## unigram - single row, 2 column frequency table. The first column: ngram,
##           contains the w_i-1 unigram (2nd word of the bigram prefix). The
##           second column: freq, contains the frequency/count of this unigram.
## bigrams - 2 column data.frame or data.table. The first column: ngram,
##           contains all the bigrams in the corpus. The second column:
##           freq, contains the frequency or count of each bigram.
## bigDisc - amount to discount observed bigrams

getAlphaBigram <- function(unigram, bigrams, bigDisc=0.5) {
    # get all bigrams that start with unigram
    regex <- sprintf("%s%s%s", "^", unigram$ngram[1], "_")
    bigsThatStartWithUnig <- bigrams[grep(regex, bigrams$ngram),]
    if(nrow(bigsThatStartWithUnig) < 1) return(0)
    alphaBi <- 1 - (sum(bigsThatStartWithUnig$freq - bigDisc) / unigram$freq)
    
    return(alphaBi)
}

unig <- str_split(bigPre, "_")[[1]][2]
unig <- unigrs[unigrs$ngram == unig,]
alpha_big <- getAlphaBigram(unig, bigrs, gamma2)
alpha_big

## [1] 0.03618548

Step 4. iii. Calculate backed off probabilities \(q_{BO}\) for bigrams

The code below calculates \(q_{BO}(w_i|w_{i−1})\) for observed and unobserved bigrams:

## Returns a character vector of backed off bigrams of the form w2_w1. These 
## are all the (w_i-1, w) bigrams where w_i-1 is the tail word of the bigram
## prefix bigPre and w are the tail words of unobserved bigrams that start with
## w_i-1.
##
## bigPre - single-element char array of the form w2_w1 which are first two
##          words of the trigram we are predicting the tail word of
## unobsTrigTails - character vector that are tail words of unobserved trigrams
getBOBigrams <- function(bigPre, unobsTrigTails) {
    w_i_minus1 <- str_split(bigPre, "_")[[1]][2]
    boBigrams <- paste(w_i_minus1, unobsTrigTails, sep = "_")
    return(boBigrams)
}

## Returns a two column data.frame of backed-off bigrams in the first column
## named ngram and their frequency/counts in the second column named freq.
## 
## bigPre -  single-element char array of the form w2_w1 which are first two
##           words of the trigram we are predicting the tail word of
## unobsTrigTails - character vector that are tail words of unobserved trigrams
## bigrs - 2 column data.frame or data.table. The first column: ngram,
##         contains all the bigrams in the corpus. The second column:
##         freq, contains the frequency/count of each bigram.
getObsBOBigrams <- function(bigPre, unobsTrigTails, bigrs) {
    boBigrams <- getBOBigrams(bigPre, unobsTrigTails)
    obs_bo_bigrams <- bigrs[bigrs$ngram %in% boBigrams, ]
    return(obs_bo_bigrams)
}

## Returns a character vector of backed-off bigrams which are unobserved.
##
## bigPre -  single-element char array of the form w2_w1 which are first two
##           words of the trigram we are predicting the tail word of
## unobsTrigTails - character vector that are tail words of unobserved trigrams
## obsBOBigram - data.frame which contains the observed bigrams in a column
##               named ngram
getUnobsBOBigrams <- function(bigPre, unobsTrigTails, obsBOBigram) {
    boBigrams <- getBOBigrams(bigPre, unobsTrigTails)
    unobs_bigs <- boBigrams[!(boBigrams %in% obsBOBigram$ngram)]
    return(unobs_bigs)
}

## Returns a dataframe of 2 columns: ngram and probs.  Values in the ngram
## column are bigrams of the form: w2_w1 which are observed as the last
## two words in unobserved trigrams.  The values in the prob column are
## q_bo(w1 | w2) calculated from from equation 10.
##
## obsBOBigrams - a dataframe with 2 columns: ngram and freq. The ngram column
##                contains bigrams of the form w1_w2 which are observed bigrams
##                that are the last 2 words of unobserved trigrams (i.e. "backed
##                off" bigrams). The freq column contains integers that are
##                the counts of these observed bigrams in the corpus.
## unigs - 2 column data.frame of all the unigrams in the corpus:
##         ngram = unigram
##         freq = frequency/count of each unigram
## bigDisc - amount to discount observed bigrams
getObsBigProbs <- function(obsBOBigrams, unigs, bigDisc=0.5) {
    first_words <- str_split_fixed(obsBOBigrams$ngram, "_", 2)[, 1]
    first_word_freqs <- unigs[unigs$ngram %in% first_words, ]
    obsBigProbs <- (obsBOBigrams$freq - bigDisc) / first_word_freqs$freq
    obsBigProbs <- data.frame(ngram=obsBOBigrams$ngram, prob=obsBigProbs)
    
    return(obsBigProbs)
}

## Returns a dataframe of 2 columns: ngram and prob.  Values in the ngram
## column are unobserved bigrams of the form: w2_w1.  The values in the prob
## column are the backed off probability estimates q_bo(w1 | w2) calculated
## from from equation 16.
##
## unobsBOBigrams - character vector of unobserved backed off bigrams
## unigs - 2 column data.frame of all the unigrams in the corpus:
##         ngram = unigram
##         freq = frequency/count of each unigram
## alphaBig - total discounted probability mass at the bigram level
getQboUnobsBigrams <- function(unobsBOBigrams, unigs, alphaBig) {
    # get the unobserved bigram tails
    qboUnobsBigs <- str_split_fixed(unobsBOBigrams, "_", 2)[, 2]
    w_in_Aw_iminus1 <- unigs[!(unigs$ngram %in% qboUnobsBigs), ]
    # convert to data.frame with counts
    qboUnobsBigs <- unigs[unigs$ngram %in% qboUnobsBigs, ]
    denom <- sum(qboUnobsBigs$freq)
    # converts counts to probabilities
    qboUnobsBigs <- data.frame(ngram=unobsBOBigrams,
                               prob=(alphaBig * qboUnobsBigs$freq / denom))
    
    return(qboUnobsBigs)
}

# Get backed off bigrams
bo_bigrams <- getBOBigrams(bigPre, unobs_trig_tails) 

# Separate bigrams into observed and unobserved using the appropriate equations
obs_bo_bigrams <- getObsBOBigrams(bigPre, unobs_trig_tails, bigrs)
unobs_bo_bigrams <- getUnobsBOBigrams(bigPre, unobs_trig_tails, obs_bo_bigrams)

# Calculate observed bigram probabilities
qbo_obs_bigrams <- getObsBigProbs(obs_bo_bigrams, unigrs, gamma2) #ngram probs

# Calculate alpha_big and unobserved bigram probabilities        
unig <- str_split(bigPre, "_")[[1]][2]
unig <- unigrs[unigrs$ngram == unig,]

# Distribute discounted bigram probability mass to unobserved bigrams in proportion to unigram ML
qbo_unobs_bigrams <- getQboUnobsBigrams(unobs_bo_bigrams, unigrs, alpha_big)
qbo_bigrams <- rbind(qbo_obs_bigrams, qbo_unobs_bigrams)
head(qbo_bigrams)

##      ngram        prob
## 1 of_their 0.009506371
## 2  of_your 0.008853787
## 3   of_all 0.007331092
## 4   of_you 0.007175429
## 5 of_peopl 0.005381323
## 6  of_time 0.004443358

Let’s Check the Bigram Calculations

Before doing the final calculations for the unobserved trigrams, let’s do a simple check on our calculations at the bigram level. In the previous table, all the bigrams are unobserved which means that if we sum all the unobserved bigram probabilities, we should get the total bigram discount which is \(\alpha(w_{i−1})\). As we see below, this looks like it checks out.

unobs <- qbo_bigrams[-1,]
sum(unobs$prob)

## [1] 0.5111907

Step 4. iv. Calculate discounted probability mass at the trigram level \(\alpha(w_{i−2},w_{i−1})\)

The getAlphaTrigram function shown below implements the following equation:

\(\alpha(w_{i-2},\:w_{i-1}) =\left[ 1 - \sum\limits_{w\:\in\:\mathcal{A}(w_{i-2},\:w_{i-1})} \frac{c^*(w_{i-2},\:w_{i-1},\:w)}{c(w_{i-2},\:w_{i-1})}\:\:\:\:\right]\)

to compute the trigram discount. Here we use it to compute the trigram discount for \(q_{BO}(house|sell,the)\):

## Returns the total probability mass discounted from all observed trigrams.
## This is the amount of probability mass which is
## redistributed to UNOBSERVED trigrams. If no trigrams starting with
## bigram$ngram[1] exist, 1 is returned.
##
## obsTrigs - 2 column data.frame or data.table. The first column: ngram,
##            contains all the observed trigrams that start with the bigram
##            prefix we are attempting to the predict the next word of. The 
##            second column: freq, contains the frequency/count of each trigram.
## bigram - single row frequency table where the first col: ngram, is the bigram
##          which are the first two words of unobserved trigrams we want to
##          estimate probabilities of (same as bigPre in other functions listed
##          prior) delimited with '_'. The second column: freq, is the
##          frequency/count of the bigram listed in the ngram column.
## triDisc - amount to discount observed trigrams
getAlphaTrigram <- function(obsTrigs, bigram, triDisc=0.5) {
    if(nrow(obsTrigs) < 1) return(1)
    alphaTri <- 1 - sum((obsTrigs$freq - triDisc) / bigram$freq[1])
    
    return(alphaTri)
}

bigram <- bigrs[bigrs$ngram %in% bigPre, ]
alpha_trig <- getAlphaTrigram(obs_trigrs, bigram, gamma3)
alpha_trig

## [1] 0.298516

Step 4. v. Calculate unobserved trigram probabilities \(q_{BO}(w_i|w_{i−2},w_{i−1})\):

## Returns a dataframe of 2 columns: ngram and prob.  Values in the ngram
## column are unobserved trigrams of the form: w3_w2_w1.  The values in the prob
## column are q_bo(w1 | w3, w2) calculated from equation 17.
##
## bigPre -  single-element char array of the form w2_w1 which are first two
##           words of the trigram we are predicting the tail word of
## qboObsBigrams - 2 column data.frame with the following columns -
##                 ngram: observed bigrams of the form w2_w1
##                 probs: the probability estimate for observed bigrams:
##                        qbo(w1 | w2) calc'd from equation 10.
## qboUnobsBigrams - 2 column data.frame with the following columns -
##                   ngram: unobserved bigrams of the form w2_w1
##                   probs: the probability estimate for unobserved bigrams
##                          qbo(w1 | w2) calc'd from equation 16.
## alphaTrig - total discounted probability mass at the trigram level
getUnobsTriProbs <- function(bigPre, qboObsBigrams,
                             qboUnobsBigrams, alphaTrig) {
    qboBigrams <- rbind(qboObsBigrams, qboUnobsBigrams)
    qboBigrams <- qboBigrams[order(-qboBigrams$prob), ]
    sumQboBigs <- sum(qboBigrams$prob)
    first_bigPre_word <- str_split(bigPre, "_")[[1]][1]
    unobsTrigNgrams <- paste(first_bigPre_word, qboBigrams$ngram, sep="_")
    unobsTrigProbs <- alphaTrig * qboBigrams$prob / sumQboBigs
    unobsTrigDf <- data.frame(ngram=unobsTrigNgrams, prob=unobsTrigProbs)
    
    return(unobsTrigDf)
}

qbo_unobs_trigrams <- getUnobsTriProbs(bigPre, qbo_obs_bigrams,
                                       qbo_unobs_bigrams, alpha_trig)
head(qbo_unobs_trigrams)

##           ngram        prob
## 1 case_of_their 0.005450009
## 2  case_of_your 0.005075882
## 3   case_of_all 0.004202920
## 4   case_of_you 0.004113678
## 5 case_of_peopl 0.003085116
## 6  case_of_time 0.002547380

Step 5. Select \(w_i\) with the highest \(q_{BO}(w_i|w_{i−2},w_{i−1})\)

We’ve done all the calculations required to make our prediction. These are summarized in the table below:

getPredictionMsg <- function(qbo_trigs) {
    # pull off tail word of highest prob trigram
    prediction <- str_split(qbo_trigs$ngram[1], "_")[[1]][3]
    result <- sprintf("%s%s%s%.4f", "highest prob prediction is >>> ", prediction," <<< which has probability = ", qbo_trigs$prob[1])
    return(result)
}

qbo_trigrams <- rbind(qbo_obs_trigrams, qbo_unobs_trigrams)
qbo_trigrams <- qbo_trigrams[order(-qbo_trigrams$prob), ]  # sort by desc prob
out_msg <- getPredictionMsg(qbo_trigrams)
out_msg

## [1] "highest prob prediction is >>> the <<< which has probability = 0.1490"

Exploring an interesting question

A good data scientist would investigate his/her results further, especially if they have not worked with a particular algorithm before. A simple first check might be to test whether all the \(q_{BO}(w_i|w_{i−2},w_{i−1})\) values sum to 1:

sum(qbo_trigrams$prob)

## [1] 1

That looks O.K., so let’s explore something more interesting to see if we can deepen our understanding.

Let’s redo the above calculations with increased discount rates at both bigram and trigram levels. If we increase our discount rates from 0.5 to 0.7, what happens? If we set \(\gamma_2 = \gamma_3 = 0.7\), these are the results we get:

gamma2=0.7; gamma3=0.7  # initialize new discount rates

obs_trigs <- getObsTrigs(bigPre, trigrs)
unobs_trig_tails <- getUnobsTrigTails(obs_trigs$ngram, unigrs)
bo_bigrams <- getBOBigrams(bigPre, unobs_trig_tails)

# Separate bigrams into observed and unobserved using the appropriate equations
obs_bo_bigrams <- getObsBOBigrams(bigPre, unobs_trig_tails, bigrs)
unobs_bo_bigrams <- getUnobsBOBigrams(bigPre, unobs_trig_tails, obs_bo_bigrams)

# Calculate observed bigram probabilites
qbo_obs_bigrams <- getObsBigProbs(obs_bo_bigrams, unigrs, gamma2)

# Calculate alpha_big and unobserved bigram probabilities
unig <- str_split(bigPre, "_")[[1]][2]
unig <- unigrs[unigrs$ngram == unig,]
alpha_big <- getAlphaBigram(unig, bigrs, gamma2)

# Distribute discounted bigram probability mass to unobserved bigrams in proportion to unigram ML
qbo_unobs_bigrams <- getQboUnobsBigrams(unobs_bo_bigrams, unigrs, alpha_big)

# Calculate observed trigram probabilities...
qbo_obs_trigrams <- getObsTriProbs(obs_trigs, bigrs, bigPre, gamma3)

# Finally, calculate unobserved trigram probabilities...
bigram <- bigrs[bigrs$ngram %in% bigPre, ]
alpha_trig <- getAlphaTrigram(obs_trigs, bigram, gamma3)
qbo_unobs_trigrams <- getUnobsTriProbs(bigPre, qbo_obs_bigrams,
                                       qbo_unobs_bigrams, alpha_trig)
qbo_trigrams <- rbind(qbo_obs_trigrams, qbo_unobs_trigrams)
qbo_trigrams <- qbo_trigrams[order(-qbo_trigrams$prob), ]
getPredictionMsg(qbo_trigrams)

## [1] "highest prob prediction is >>> the <<< which has probability = 0.1487"

Data Science in R Specialization - Capstone Project

Katz Backoff Prediction Model - Week 3

Ken Wood

10/17/2020

Applying the Katz Backoff Algorithm:

First, we read in the text files…

Create a sample dataset using 25% each of the `.txt` files

Delete `blog_line`,`news_line` and `twitter_line` files to free up memory since we don’t need them anymore.

Step 1. i. Unigram, Bigram and Trigram counts

Create a corpus of words and then clean it using the `quanteda` package.

Delete `samplecorpus`and `sample_corpus` files to free up memory since we don’t need them anymore.

Get frequencies of n-grams from the sample corpus…

Delete `dfm_sample_corpus_tokens`, `sample_corpus_tokens`, and the n-gram frequency files to free up memory since we don’t need them anymore.

Step 1. ii. Selecting bigram and trigram discounts

Step 2. Select Bigram Prefix of Word to be Predicted

Step 3. Calculate Probabilities of Words Completing Observed Trigrams

Step 4. Calculate Probabilities of Words Completing Unobserved Trigrams

Step 4. i. Find unobserved trigram tail words:

Step 4. ii. Calculate discounted probability mass at the bigram level \(\alpha(w_{i−1})\):

Step 4. iii. Calculate backed off probabilities \(q_{BO}\) for bigrams

Let’s Check the Bigram Calculations

Step 4. iv. Calculate discounted probability mass at the trigram level \(\alpha(w_{i−2},w_{i−1})\)

Step 4. v. Calculate unobserved trigram probabilities \(q_{BO}(w_i|w_{i−2},w_{i−1})\):

Step 5. Select \(w_i\) with the highest \(q_{BO}(w_i|w_{i−2},w_{i−1})\)

Exploring an interesting question

Data Science in R Specialization - Capstone Project

Katz Backoff Prediction Model - Week 3

Ken Wood

10/17/2020

Applying the Katz Backoff Algorithm:

First, we read in the text files…

Create a sample dataset using 25% each of the .txt files

Delete blog_line,news_line and twitter_line files to free up memory since we don’t need them anymore.

Step 1. i. Unigram, Bigram and Trigram counts

Create a corpus of words and then clean it using the quanteda package.

Delete samplecorpusand sample_corpus files to free up memory since we don’t need them anymore.

Get frequencies of n-grams from the sample corpus…

Delete dfm_sample_corpus_tokens, sample_corpus_tokens, and the n-gram frequency files to free up memory since we don’t need them anymore.

Step 1. ii. Selecting bigram and trigram discounts

Step 2. Select Bigram Prefix of Word to be Predicted

Step 3. Calculate Probabilities of Words Completing Observed Trigrams

Step 4. Calculate Probabilities of Words Completing Unobserved Trigrams

Step 4. i. Find unobserved trigram tail words:

Step 4. ii. Calculate discounted probability mass at the bigram level \(\alpha(w_{i−1})\):

Step 4. iii. Calculate backed off probabilities \(q_{BO}\) for bigrams

Let’s Check the Bigram Calculations

Step 4. iv. Calculate discounted probability mass at the trigram level \(\alpha(w_{i−2},w_{i−1})\)

Step 4. v. Calculate unobserved trigram probabilities \(q_{BO}(w_i|w_{i−2},w_{i−1})\):

Step 5. Select \(w_i\) with the highest \(q_{BO}(w_i|w_{i−2},w_{i−1})\)

Exploring an interesting question

Create a sample dataset using 25% each of the `.txt` files

Delete `blog_line`,`news_line` and `twitter_line` files to free up memory since we don’t need them anymore.

Create a corpus of words and then clean it using the `quanteda` package.

Delete `samplecorpus`and `sample_corpus` files to free up memory since we don’t need them anymore.

Delete `dfm_sample_corpus_tokens`, `sample_corpus_tokens`, and the n-gram frequency files to free up memory since we don’t need them anymore.