The goal here is to build first simple model for the relationship between words. This is the first step in building a predictive text mining application.
Tasks to accomplish:
Questions to consider:
Exploratory analysis report is located here.
Lets load n-gram frequency dictionary (for training) and clean up it from very rare n-grmas. For example, rare (frequency equals to one) unigrams and bigrams can be discarded. Because 3-grams and 4-grams are outnumbered.
ngram.freq.files.groups <- list(
list.files("../exploratory-analysis/output/freq-table",
pattern="twitter\\.txt\\.\\d+\\.rds$", full.names=T),
list.files("../exploratory-analysis/output/freq-table",
pattern="blogs\\.txt\\.\\d+\\.rds$", full.names=T),
list.files("../exploratory-analysis/output/freq-table",
pattern="news\\.txt\\.\\d+\\.rds$", full.names=T)
)
CombineNgramFreqTablesGroups <- function(files.groups) {
ngram.freq.table <- ComputeNgramFreqTablesGroup(files.groups[[1L]])
for (files.group in files.groups[-1L]) {
ngram.freq.table <- MergeNGramFreqTables(
ngram.freq.table,
ComputeNgramFreqTablesGroup(files.group))
}
return(ngram.freq.table)
}
ComputeNgramFreqTablesGroup <- function(files.group) {
ngram.freq.table <- CombinePartitions(files.group,
reduce.f=MergeNGramFreqTables)
return(ngram.freq.table[(n >= 3L) | ((n < 3L) & (freq > 1L))])
}
if (!file.exists("./output/train.freq.table.rds")) {
train.freq.table <- CombineNgramFreqTablesGroups(ngram.freq.files.groups)
saveRDS(train.freq.table, "./output/train.freq.table.rds")
} else {
train.freq.table <- readRDS("./output/train.freq.table.rds")
}
setkey(train.freq.table, first.words, last.word, n)
summary(train.freq.table)
## first.words last.word n freq
## Length:3364350 Length:3364350 Min. :1.000 Min. : 2.00
## Class :character Class :character 1st Qu.:2.000 1st Qu.: 2.00
## Mode :character Mode :character Median :2.000 Median : 2.00
## Mean :2.291 Mean : 17.63
## 3rd Qu.:3.000 3rd Qu.: 5.00
## Max. :4.000 Max. :255534.00
Also some test set should be loaded to estimate algorithm accuracy.
if (!file.exists("./output/test.freq.table.rds")) {
set.seed(123456L)
test.files.group <- sapply(ngram.freq.files.groups,
function(files.group) sample(files.group, 1L))
test.freq.table <- ComputeNgramFreqTablesGroup(test.files.group)
test.freq.table <- test.freq.table[n > 1L, .SD[which.max(freq)],
by=first.words]
saveRDS(test.freq.table, "./output/test.freq.table.rds")
} else {
test.freq.table <- readRDS("./output/test.freq.table.rds")
}
setkey(test.freq.table, first.words, last.word, n)
summary(test.freq.table)
## first.words last.word n freq
## Length:7565 Length:7565 Min. :2.000 Min. : 2.00
## Class :character Class :character 1st Qu.:2.000 1st Qu.: 2.00
## Mode :character Mode :character Median :2.000 Median : 2.00
## Mean :2.499 Mean : 3.82
## 3rd Qu.:3.000 3rd Qu.: 3.00
## Max. :4.000 Max. :285.00
Model is expected to work in mobile environment. That’s why memory usage and latency are the most important criteria. Let’s fix the following constraints:
Accuracy can be estimated by means of holdout method. Data should be split into groups: training and test corpus. So trained algorithm should be treated against test corpus.
With n-gram models it is necessary to find the right trade off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones.
According to this approach, 5-grams dictionary can be some kind of overkill relative to the mobile domain.
If something entered by user can’t be found in dictionary, following exceptional cases exist:
The first question - how to distinguish between these cases?
The only reliable way (as I can see) is based on “help” from user. If the last character is whitespace, the third case (uncomplete last word) should be discarded. But it’s not user-friendly.
Any automatical approach? I think in most cases it will be enough to check the last entered word in dictionary. If such unigram does not exist, entered text should be considered without last word.
Other question is related to misprints/typos. Different string metrics can be used. For example: Q-gram, Jaccard, Cosine. Unfortunately this approach can’t be mixed this Kneser-Ney smoothing (at least it’s very hard task), because all probabilities should be recomputed according to choosen metric.
There are problems of balance weight between infrequent grams (for example, if a proper name appeared in the training data) and frequent grams. Also, items not seen in the training data will be given a probability of 0.0 without smoothing. For unseen but plausible data from a sample, one can introduce pseudocounts. Pseudocounts are generally motivated on Bayesian grounds.
In practice it is necessary to smooth the probability distributions by also assigning non-zero probabilities to unseen words or n-grams. The reason is that models derived directly from the n-gram frequency counts have severe problems when confronted with any n-grams that have not explicitly been seen before – the zero-frequency problem.
Different smoothing approaches exist:
The most effective and universal approach (according to different sources) - is Kneser-Ney. It will be considered in this project.
Let’s start with some simple model.
Algorithm steps:
Testing phrase: I cant wait
Predicted words:
PredictNextWords(train.freq.table, "I cant wait",
result.prepare.strategy=function(result) result$last.word)
## [1] "see" "get" "till" "hear" "go" "til" "next" "read"
## [9] "meet" "watch"
ProfilePredictor <- function(predictor, ngram.freq, queries, max.ngram.count=10L) {
tmp.file <- tempfile()
sampling.time <- sapply(queries, FUN=function(query) {
Rprof(tmp.file)
do.call(predictor, list(ngram.freq=ngram.freq, query.text=query,
max.ngram.count=max.ngram.count))
Rprof(NULL)
return(summaryRprof(tmp.file)$sampling.time)
})
unlink(tmp.file)
return(sampling.time)
}
format(object.size(train.freq.table), units="MB")
## [1] "134 Mb"
Memory usage is small enough.
ngram.testing.queries <- c("first", "just", "cant", "will", "love",
"good luck", "one day", "look like", "good morning",
"can get",
"cant wait see", "dont even know",
"new york city", "happy new year",
"dont feel like")
ngram.sampling.time <- ProfilePredictor(PredictNextWords,
train.freq.table,
queries=ngram.testing.queries)
print(paste(round(mean(ngram.sampling.time), 3), "seconds"))
## [1] "0.54 seconds"
So average latency is acceptable in most cases.
N-grams approach has fundamental disadvantage in context of unseen terms. So accuracy is estimated by weak rule: if expected word is contained in top ten predictions, such predictin is considered valid.
EstimatePredictorAccuracy <- function(predictor,
train.freq.table,
test.freq.table,
max.ngram.count=10L) {
assert <- function(first.words, expected.word) {
next.words.freq.table <- predictor(ngram.freq=train.freq.table,
query.text=first.words,
max.ngram.count=max.ngram.count)
return(any(expected.word == next.words.freq.table$last.word))
}
assertion.result <- test.freq.table[, assert(first.words, last.word),
by=1:nrow(test.freq.table)]$V1
return(sum(assertion.result) / length(assertion.result))
}
Accuracy (percent):
format(EstimatePredictorAccuracy(PredictNextWords,
train.freq.table,
test.freq.table[sample(1:nrow(test.freq.table),
size=100)]) * 100,
digits=3L)
## [1] "82"
Quiz has been taken from Coursera.
test.quiz.1 <- list("The guy in front of me just bought a pound of bacon, a bouquet, and a case of" = c("pretzels", "soda", "beer", "cheese"),
"You're the reason why I smile everyday. Can you follow me please? It would mean the" = c("most", "world", "universe", "best"),
"Hey sunshine, can you follow me and make me the" = c("bluest", "saddest", "happiest", "smelliest"),
"Very early observations on the Bills game: Offense still struggling but the" = c("referees", "defense", "players", "crowd"),
"Go on a romantic date at the" = c("movies", "grocery", "beach", "mall"),
"Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my" = c("phone", "horse", "way", "motorcycle"),
"Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some" = c("years", "weeks", "time", "thing"),
"After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little" = c("ears", "fingers", "toes", "eyes"),
"Be grateful for the good times and keep the faith during the" = c("hard", "worse", "sad", "bad"),
"If this isn't the cutest thing you've ever seen, then you must be" = c("insensitive", "asleep", "callous", "insane"))
TestPredictorAgainstQuiz <- function(predictor, train.freq.table,
test.quiz,
max.ngram.count=500L) {
res <- lapply(names(test.quiz), FUN=function(query) {
expected.next.words <- test.quiz[[query]]
next.words.freq.table <- predictor(ngram.freq=train.freq.table,
query.text=query,
max.ngram.count=max.ngram.count)
rating.position <- next.words.freq.table[, which(last.word %in% expected.next.words)]
return(data.table(next.words=next.words.freq.table[rating.position, last.word],
rating.position=rating.position))
})
names(res) <- names(test.quiz)
return(res)
}
simple.predict.quiz.result <- TestPredictorAgainstQuiz(PredictNextWords,
train.freq.table,
test.quiz.1)
print(simple.predict.quiz.result)
## $`The guy in front of me just bought a pound of bacon, a bouquet, and a case of`
## next.words rating.position
## 1: beer 52
##
## $`You're the reason why I smile everyday. Can you follow me please? It would mean the`
## next.words rating.position
## 1: world 1
## 2: world 2
## 3: world 10
## 4: best 121
## 5: universe 466
##
## $`Hey sunshine, can you follow me and make me the`
## next.words rating.position
## 1: happiest 7
## 2: happiest 390
##
## $`Very early observations on the Bills game: Offense still struggling but the`
## Empty data.table (0 rows) of 2 cols: next.words,rating.position
##
## $`Go on a romantic date at the`
## Empty data.table (0 rows) of 2 cols: next.words,rating.position
##
## $`Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my`
## next.words rating.position
## 1: way 155
##
## $`Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some`
## next.words rating.position
## 1: time 1
## 2: time 7
## 3: years 82
## 4: thing 240
##
## $`After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little`
## next.words rating.position
## 1: fingers 253
## 2: eyes 330
##
## $`Be grateful for the good times and keep the faith during the`
## next.words rating.position
## 1: hard 166
##
## $`If this isn't the cutest thing you've ever seen, then you must be`
## next.words rating.position
## 1: insane 437
Lets try Kneser-Ney smoothing.
Kneser-Ney evolved from absolute-discounting interpolation, which makes use of both higher-order (i.e., higher-n) and lower-order language models, reallocating some probability mass from 4-grams or 3-grams to simpler unigram models.
source("../common/kneser-ney.R", chdir=T)
if (!file.exists("./output/kn.train.freq.table.rds")) {
kn.train.freq.table <- SmoothNgramFreqTableByKneserNey(train.freq.table)
saveRDS(kn.train.freq.table, "./output/kn.train.freq.table.rds")
} else {
kn.train.freq.table <- readRDS("./output/kn.train.freq.table.rds")
}
Testing phrase: I cant wait
Predicted words:
PredictNextWordsByKneserNey(kn.train.freq.table, "I cant wait",
result.prepare.strategy=function(result)
result$last.word)
## [1] "see" "get" "till" "hear" "go" "til" "next" "read"
## [9] "meet" "watch"
format(object.size(kn.train.freq.table), units="MB")
## [1] "159.6 Mb"
Memory usage is small enough.
ngram.sampling.time <- ProfilePredictor(PredictNextWordsByKneserNey,
kn.train.freq.table,
queries=ngram.testing.queries)
print(paste(round(mean(ngram.sampling.time), 3), "seconds"))
## [1] "0.335 seconds"
So average latency is acceptable in most cases.
kn.predict.quiz.result <- TestPredictorAgainstQuiz(PredictNextWordsByKneserNey,
kn.train.freq.table,
test.quiz.1)
print(kn.predict.quiz.result)
## $`The guy in front of me just bought a pound of bacon, a bouquet, and a case of`
## next.words rating.position
## 1: beer 54
##
## $`You're the reason why I smile everyday. Can you follow me please? It would mean the`
## next.words rating.position
## 1: world 1
## 2: world 2
## 3: world 10
## 4: best 121
## 5: universe 458
##
## $`Hey sunshine, can you follow me and make me the`
## next.words rating.position
## 1: happiest 8
## 2: happiest 405
##
## $`Very early observations on the Bills game: Offense still struggling but the`
## Empty data.table (0 rows) of 2 cols: next.words,rating.position
##
## $`Go on a romantic date at the`
## Empty data.table (0 rows) of 2 cols: next.words,rating.position
##
## $`Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my`
## next.words rating.position
## 1: way 82
##
## $`Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some`
## next.words rating.position
## 1: time 1
## 2: time 7
## 3: years 80
## 4: thing 227
##
## $`After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little`
## next.words rating.position
## 1: fingers 261
## 2: eyes 330
##
## $`Be grateful for the good times and keep the faith during the`
## next.words rating.position
## 1: hard 163
##
## $`If this isn't the cutest thing you've ever seen, then you must be`
## next.words rating.position
## 1: insane 453
ComparePredictorsByQuiz <- function(quiz.result.a, quiz.result.b) {
res <- lapply(names(quiz.result.a), function(query) {
a <- quiz.result.a[[query]][1]
b <- quiz.result.b[[query]][1]
return(data.frame(a=a$next.words, b=b$next.words,
rating.delta=a$rating.position - b$rating.position))
})
names(res) <- names(quiz.result.a)
return(res)
}
ComparePredictorsByQuiz(simple.predict.quiz.result, kn.predict.quiz.result)
## $`The guy in front of me just bought a pound of bacon, a bouquet, and a case of`
## a b rating.delta
## 1 beer beer -2
##
## $`You're the reason why I smile everyday. Can you follow me please? It would mean the`
## a b rating.delta
## 1 world world 0
##
## $`Hey sunshine, can you follow me and make me the`
## a b rating.delta
## 1 happiest happiest -1
##
## $`Very early observations on the Bills game: Offense still struggling but the`
## a b rating.delta
## 1 <NA> <NA> NA
##
## $`Go on a romantic date at the`
## a b rating.delta
## 1 <NA> <NA> NA
##
## $`Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my`
## a b rating.delta
## 1 way way 73
##
## $`Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some`
## a b rating.delta
## 1 time time 0
##
## $`After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little`
## a b rating.delta
## 1 fingers fingers -8
##
## $`Be grateful for the good times and keep the faith during the`
## a b rating.delta
## 1 hard hard 3
##
## $`If this isn't the cutest thing you've ever seen, then you must be`
## a b rating.delta
## 1 insane insane -16
Delta rating field shows difference in prediction result position between simple predictor and Kneser-Ney based predictor. The closer position to the begining of the list, the better algorithm.
According to comparison results, Kneser-Ney smoothing has no effect in most cases.