Introduction

The goal here is to build first simple model for the relationship between words. This is the first step in building a predictive text mining application.

Tasks to accomplish:

Questions to consider:

  1. How to efficiently store an n-gram model (may be Markov Chains)?
  2. How to use the knowledge about word frequencies to make model smaller and more efficient?
  3. How many parameters is really needed (i.e. how big is n in n-gram model)?
  4. Simple ways to “smooth” the probabilities (for example, giving all n-grams a non-zero probability even if they aren’t observed in the data) ?
  5. How to evaluate whether model is any good?
  6. How to use backoff models to estimate the probability of unobserved n-grams?

Exploratory Analysis Results Overview

Exploratory analysis report is located here.

  1. Significant part of dictionary consists of very rare words. So dictionary can be reduced.
  2. Corpus coverage has logarithmic dependency on dictionary size.
  3. Most efficient model could cover only ~70% of the language.
  4. Stemming can help to increase coverage. But this is complicated technique.

Lets load n-gram frequency dictionary (for training) and clean up it from very rare n-grmas. For example, rare (frequency equals to one) unigrams and bigrams can be discarded. Because 3-grams and 4-grams are outnumbered.

ngram.freq.files.groups <- list(
    list.files("../exploratory-analysis/output/freq-table",
               pattern="twitter\\.txt\\.\\d+\\.rds$", full.names=T),
    list.files("../exploratory-analysis/output/freq-table",
               pattern="blogs\\.txt\\.\\d+\\.rds$", full.names=T),
    list.files("../exploratory-analysis/output/freq-table",
               pattern="news\\.txt\\.\\d+\\.rds$", full.names=T)
)
    CombineNgramFreqTablesGroups <- function(files.groups) {
        ngram.freq.table <- ComputeNgramFreqTablesGroup(files.groups[[1L]])
        for (files.group in files.groups[-1L]) {
            ngram.freq.table <- MergeNGramFreqTables(
                    ngram.freq.table,
                    ComputeNgramFreqTablesGroup(files.group))
        }
        return(ngram.freq.table)
    }
    ComputeNgramFreqTablesGroup <- function(files.group) {
        ngram.freq.table <- CombinePartitions(files.group,
                                              reduce.f=MergeNGramFreqTables)
        return(ngram.freq.table[(n >= 3L) | ((n < 3L) & (freq > 1L))])
    }
if (!file.exists("./output/train.freq.table.rds")) {
    train.freq.table <- CombineNgramFreqTablesGroups(ngram.freq.files.groups)
    saveRDS(train.freq.table, "./output/train.freq.table.rds")
} else {
    train.freq.table <- readRDS("./output/train.freq.table.rds")
}
setkey(train.freq.table, first.words, last.word, n)
summary(train.freq.table)
##  first.words         last.word               n              freq          
##  Length:3364350     Length:3364350     Min.   :1.000   Min.   :     2.00  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:     2.00  
##  Mode  :character   Mode  :character   Median :2.000   Median :     2.00  
##                                        Mean   :2.291   Mean   :    17.63  
##                                        3rd Qu.:3.000   3rd Qu.:     5.00  
##                                        Max.   :4.000   Max.   :255534.00

Also some test set should be loaded to estimate algorithm accuracy.

if (!file.exists("./output/test.freq.table.rds")) {
    set.seed(123456L)
    test.files.group <- sapply(ngram.freq.files.groups,
                               function(files.group) sample(files.group, 1L))
    test.freq.table <- ComputeNgramFreqTablesGroup(test.files.group)
    test.freq.table <- test.freq.table[n > 1L, .SD[which.max(freq)],
                                       by=first.words]
    saveRDS(test.freq.table, "./output/test.freq.table.rds")
} else {
    test.freq.table <- readRDS("./output/test.freq.table.rds")
}
setkey(test.freq.table, first.words, last.word, n)
summary(test.freq.table)
##  first.words         last.word               n              freq       
##  Length:7565        Length:7565        Min.   :2.000   Min.   :  2.00  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:  2.00  
##  Mode  :character   Mode  :character   Median :2.000   Median :  2.00  
##                                        Mean   :2.499   Mean   :  3.82  
##                                        3rd Qu.:3.000   3rd Qu.:  3.00  
##                                        Max.   :4.000   Max.   :285.00

Model Effectiveness Criteria

Model is expected to work in mobile environment. That’s why memory usage and latency are the most important criteria. Let’s fix the following constraints:

  1. Accuracy > 70%.
  2. Memory usage < 300 Mb.
  3. Latency < 1 sec.

Accuracy can be estimated by means of holdout method. Data should be split into groups: training and test corpus. So trained algorithm should be treated against test corpus.

Model Common Parameters

With n-gram models it is necessary to find the right trade off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones.

According to this approach, 5-grams dictionary can be some kind of overkill relative to the mobile domain.

Handling Unseen N-grams

If something entered by user can’t be found in dictionary, following exceptional cases exist:

  1. It’s really unseen n-gram.
  2. Some words contain misprints (but can be found in normal form).
  3. The last entered word is uncomplete. And user wants the application to propose appropriate word completions.

The first question - how to distinguish between these cases?

The only reliable way (as I can see) is based on “help” from user. If the last character is whitespace, the third case (uncomplete last word) should be discarded. But it’s not user-friendly.

Any automatical approach? I think in most cases it will be enough to check the last entered word in dictionary. If such unigram does not exist, entered text should be considered without last word.

Other question is related to misprints/typos. Different string metrics can be used. For example: Q-gram, Jaccard, Cosine. Unfortunately this approach can’t be mixed this Kneser-Ney smoothing (at least it’s very hard task), because all probabilities should be recomputed according to choosen metric.

Probabilities Smoothing

There are problems of balance weight between infrequent grams (for example, if a proper name appeared in the training data) and frequent grams. Also, items not seen in the training data will be given a probability of 0.0 without smoothing. For unseen but plausible data from a sample, one can introduce pseudocounts. Pseudocounts are generally motivated on Bayesian grounds.

In practice it is necessary to smooth the probability distributions by also assigning non-zero probabilities to unseen words or n-grams. The reason is that models derived directly from the n-gram frequency counts have severe problems when confronted with any n-grams that have not explicitly been seen before – the zero-frequency problem.

Different smoothing approaches exist:

The most effective and universal approach (according to different sources) - is Kneser-Ney. It will be considered in this project.

Simple Prediction Model

Let’s start with some simple model.

Model Overview

Algorithm steps:

  1. Clean up provided query (the same way corpora has been cleaned up).
  2. Tokenize and compute number of words.
  3. Choose n (for n-gram) equals to number of words plus one word.
  4. Search for appropriate n-grams and order results from common (high frequency, or high probability) to rare.
  5. If nothing found (or less than some coefficient), remove first word from query, and go to the step 3.
  6. Last words of found n-grams represent the prediciton result.

Simple Model Smoke Testing

Testing phrase: I cant wait

Predicted words:

PredictNextWords(train.freq.table, "I cant wait",
                 result.prepare.strategy=function(result) result$last.word)
##  [1] "see"   "get"   "till"  "hear"  "go"    "til"   "next"  "read" 
##  [9] "meet"  "watch"

Simple Model Profiling

ProfilePredictor <- function(predictor, ngram.freq, queries, max.ngram.count=10L) {
    tmp.file <- tempfile()
    sampling.time <- sapply(queries, FUN=function(query) {
        Rprof(tmp.file)
        do.call(predictor, list(ngram.freq=ngram.freq, query.text=query,
                                max.ngram.count=max.ngram.count))
        Rprof(NULL)
        return(summaryRprof(tmp.file)$sampling.time)
    })
    unlink(tmp.file)
    return(sampling.time)
}

Memory Usage

format(object.size(train.freq.table), units="MB")
## [1] "134 Mb"

Memory usage is small enough.

Performance

ngram.testing.queries <- c("first", "just", "cant", "will", "love",
                           "good luck", "one day", "look like", "good morning", 
                           "can get",
                           "cant wait see", "dont even know",
                           "new york city", "happy new year",
                           "dont feel like")
ngram.sampling.time <- ProfilePredictor(PredictNextWords,
                                        train.freq.table,
                                        queries=ngram.testing.queries)
print(paste(round(mean(ngram.sampling.time), 3), "seconds"))
## [1] "0.54 seconds"

So average latency is acceptable in most cases.

Simple Model Accuracy

N-grams approach has fundamental disadvantage in context of unseen terms. So accuracy is estimated by weak rule: if expected word is contained in top ten predictions, such predictin is considered valid.

EstimatePredictorAccuracy <- function(predictor,
                                      train.freq.table,
                                      test.freq.table,
                                      max.ngram.count=10L) {
    assert <- function(first.words, expected.word) {
        next.words.freq.table <- predictor(ngram.freq=train.freq.table,
                                           query.text=first.words,
                                           max.ngram.count=max.ngram.count)
        return(any(expected.word == next.words.freq.table$last.word))
    }
    assertion.result <- test.freq.table[, assert(first.words, last.word),
                                        by=1:nrow(test.freq.table)]$V1
    return(sum(assertion.result) / length(assertion.result))
}

Accuracy (percent):

format(EstimatePredictorAccuracy(PredictNextWords,
                                 train.freq.table,
                                 test.freq.table[sample(1:nrow(test.freq.table),
                                                        size=100)]) * 100,
       digits=3L)
## [1] "82"

Simple Model Quiz

Quiz has been taken from Coursera.

test.quiz.1 <- list("The guy in front of me just bought a pound of bacon, a bouquet, and a case of" = c("pretzels", "soda", "beer", "cheese"),
                    "You're the reason why I smile everyday. Can you follow me please? It would mean the" = c("most", "world", "universe", "best"),
                    "Hey sunshine, can you follow me and make me the" = c("bluest", "saddest", "happiest", "smelliest"),
                    "Very early observations on the Bills game: Offense still struggling but the" = c("referees", "defense", "players", "crowd"),
                    "Go on a romantic date at the" = c("movies", "grocery", "beach", "mall"),
                    "Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my" = c("phone", "horse", "way", "motorcycle"),
                    "Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some" = c("years", "weeks", "time", "thing"),
                    "After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little" = c("ears", "fingers", "toes", "eyes"),
                    "Be grateful for the good times and keep the faith during the" = c("hard", "worse", "sad", "bad"),
                    "If this isn't the cutest thing you've ever seen, then you must be" = c("insensitive", "asleep", "callous", "insane"))
TestPredictorAgainstQuiz <- function(predictor, train.freq.table,
         test.quiz,
         max.ngram.count=500L) {
    res <- lapply(names(test.quiz), FUN=function(query) {
        expected.next.words <- test.quiz[[query]]
        next.words.freq.table <- predictor(ngram.freq=train.freq.table,
                                           query.text=query,
                                           max.ngram.count=max.ngram.count)
        rating.position <- next.words.freq.table[, which(last.word %in% expected.next.words)]
        return(data.table(next.words=next.words.freq.table[rating.position, last.word],
                          rating.position=rating.position))
    })
    names(res) <- names(test.quiz)
    return(res)
}
simple.predict.quiz.result <- TestPredictorAgainstQuiz(PredictNextWords,
                                                       train.freq.table,
                                                       test.quiz.1)
print(simple.predict.quiz.result)
## $`The guy in front of me just bought a pound of bacon, a bouquet, and a case of`
##    next.words rating.position
## 1:       beer              52
## 
## $`You're the reason why I smile everyday. Can you follow me please? It would mean the`
##    next.words rating.position
## 1:      world               1
## 2:      world               2
## 3:      world              10
## 4:       best             121
## 5:   universe             466
## 
## $`Hey sunshine, can you follow me and make me the`
##    next.words rating.position
## 1:   happiest               7
## 2:   happiest             390
## 
## $`Very early observations on the Bills game: Offense still struggling but the`
## Empty data.table (0 rows) of 2 cols: next.words,rating.position
## 
## $`Go on a romantic date at the`
## Empty data.table (0 rows) of 2 cols: next.words,rating.position
## 
## $`Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my`
##    next.words rating.position
## 1:        way             155
## 
## $`Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some`
##    next.words rating.position
## 1:       time               1
## 2:       time               7
## 3:      years              82
## 4:      thing             240
## 
## $`After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little`
##    next.words rating.position
## 1:    fingers             253
## 2:       eyes             330
## 
## $`Be grateful for the good times and keep the faith during the`
##    next.words rating.position
## 1:       hard             166
## 
## $`If this isn't the cutest thing you've ever seen, then you must be`
##    next.words rating.position
## 1:     insane             437

Kneser-Ney Based Model

Lets try Kneser-Ney smoothing.

Kneser-Ney evolved from absolute-discounting interpolation, which makes use of both higher-order (i.e., higher-n) and lower-order language models, reallocating some probability mass from 4-grams or 3-grams to simpler unigram models.

source("../common/kneser-ney.R", chdir=T)
if (!file.exists("./output/kn.train.freq.table.rds")) {
    kn.train.freq.table <- SmoothNgramFreqTableByKneserNey(train.freq.table)
    saveRDS(kn.train.freq.table, "./output/kn.train.freq.table.rds")
} else {
    kn.train.freq.table <- readRDS("./output/kn.train.freq.table.rds")
}

Kneser-Ney Model Smoke Testing

Testing phrase: I cant wait

Predicted words:

PredictNextWordsByKneserNey(kn.train.freq.table, "I cant wait",
                            result.prepare.strategy=function(result)
                                result$last.word)
##  [1] "see"   "get"   "till"  "hear"  "go"    "til"   "next"  "read" 
##  [9] "meet"  "watch"

Kneser-Ney Model Profiling

Memory Usage

format(object.size(kn.train.freq.table), units="MB")
## [1] "159.6 Mb"

Memory usage is small enough.

Performance

ngram.sampling.time <- ProfilePredictor(PredictNextWordsByKneserNey,
                                        kn.train.freq.table,
                                        queries=ngram.testing.queries)
print(paste(round(mean(ngram.sampling.time), 3), "seconds"))
## [1] "0.335 seconds"

So average latency is acceptable in most cases.

Kneser-Ney Model Quiz

kn.predict.quiz.result <- TestPredictorAgainstQuiz(PredictNextWordsByKneserNey,
                                                   kn.train.freq.table,
                                                   test.quiz.1)
print(kn.predict.quiz.result)
## $`The guy in front of me just bought a pound of bacon, a bouquet, and a case of`
##    next.words rating.position
## 1:       beer              54
## 
## $`You're the reason why I smile everyday. Can you follow me please? It would mean the`
##    next.words rating.position
## 1:      world               1
## 2:      world               2
## 3:      world              10
## 4:       best             121
## 5:   universe             458
## 
## $`Hey sunshine, can you follow me and make me the`
##    next.words rating.position
## 1:   happiest               8
## 2:   happiest             405
## 
## $`Very early observations on the Bills game: Offense still struggling but the`
## Empty data.table (0 rows) of 2 cols: next.words,rating.position
## 
## $`Go on a romantic date at the`
## Empty data.table (0 rows) of 2 cols: next.words,rating.position
## 
## $`Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my`
##    next.words rating.position
## 1:        way              82
## 
## $`Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some`
##    next.words rating.position
## 1:       time               1
## 2:       time               7
## 3:      years              80
## 4:      thing             227
## 
## $`After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little`
##    next.words rating.position
## 1:    fingers             261
## 2:       eyes             330
## 
## $`Be grateful for the good times and keep the faith during the`
##    next.words rating.position
## 1:       hard             163
## 
## $`If this isn't the cutest thing you've ever seen, then you must be`
##    next.words rating.position
## 1:     insane             453

Compare Kneser-Ney to Simple Model by Quiz

ComparePredictorsByQuiz <- function(quiz.result.a, quiz.result.b) {
    res <- lapply(names(quiz.result.a), function(query) {
        a <- quiz.result.a[[query]][1]
        b <- quiz.result.b[[query]][1]
        return(data.frame(a=a$next.words, b=b$next.words,
                          rating.delta=a$rating.position - b$rating.position))
    })
    names(res) <- names(quiz.result.a)
    return(res)
}
ComparePredictorsByQuiz(simple.predict.quiz.result, kn.predict.quiz.result)
## $`The guy in front of me just bought a pound of bacon, a bouquet, and a case of`
##      a    b rating.delta
## 1 beer beer           -2
## 
## $`You're the reason why I smile everyday. Can you follow me please? It would mean the`
##       a     b rating.delta
## 1 world world            0
## 
## $`Hey sunshine, can you follow me and make me the`
##          a        b rating.delta
## 1 happiest happiest           -1
## 
## $`Very early observations on the Bills game: Offense still struggling but the`
##      a    b rating.delta
## 1 <NA> <NA>           NA
## 
## $`Go on a romantic date at the`
##      a    b rating.delta
## 1 <NA> <NA>           NA
## 
## $`Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my`
##     a   b rating.delta
## 1 way way           73
## 
## $`Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some`
##      a    b rating.delta
## 1 time time            0
## 
## $`After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little`
##         a       b rating.delta
## 1 fingers fingers           -8
## 
## $`Be grateful for the good times and keep the faith during the`
##      a    b rating.delta
## 1 hard hard            3
## 
## $`If this isn't the cutest thing you've ever seen, then you must be`
##        a      b rating.delta
## 1 insane insane          -16

Delta rating field shows difference in prediction result position between simple predictor and Kneser-Ney based predictor. The closer position to the begining of the list, the better algorithm.

According to comparison results, Kneser-Ney smoothing has no effect in most cases.

Conclusions

  1. In case of misprints/typos different string metrics can be used (Q-gram, Jaccard, Cosine).
  2. Unseen word entered by user at the end of the text can be considered as uncomplete and ignored by algorithm.
  3. Sometimes Kneser-Ney smoothing is better than nothing.