Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. Smart keyboard makes it easier for people to type on their mobile devices. One cornerstone of smart keyboard is predictive text models.
When someone types:
I went to the
the keyboard presents several options for what the next word might be. For example:
gym, store, restaurant
This project is aimed on building predictive text models like those used by SwiftKey.
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this report is to understand the basic relationships observed in the data and prepare to build first linguistic models.
Steps:
Questions to consider:
if (!file.exists("./raw-data.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile="./raw-data.zip", method="auto")
}
if (!file.exists("./input")) {
unzip("./raw-data.zip", files=c("final/en_US/en_US.blogs.txt", "final/en_US/en_US.news.txt", "final/en_US/en_US.twitter.txt"), exdir="input", junkpaths=T)
}
Data looks like simple text files. Each line inside text file represents individual message or post.
Sizes, line couts, and word counts.
GetTextFileStat <- function(file.path) {
size.mb <- round(file.info(file.path)$size / 1024 / 1024)
lines <- readLines(file.path, warn=F, skipNul=T)
word.count <- sum(stri_count_words(lines))
return(list("Size (Mb)"=size.mb, "Line count"=length(lines),
"Word count"=word.count))
}
File en_US.blogs.txt stat:
if (!file.exists("./output/en_US.blogs.txt.stat.rds")) {
en_US.blogs.txt.stat <- GetTextFileStat("./input/en_US.blogs.txt")
saveRDS(en_US.blogs.txt.stat, "./output/en_US.blogs.txt.stat.rds")
} else {
en_US.blogs.txt.stat <- readRDS("./output/en_US.blogs.txt.stat.rds")
}
print(en_US.blogs.txt.stat)
## $`Size (Mb)`
## [1] 200
##
## $`Line count`
## [1] 899288
##
## $`Word count`
## [1] 38031339
File en_US.news.txt stat:
if (!file.exists("./output/en_US.news.txt.stat.rds")) {
en_US.news.txt.stat <- GetTextFileStat("./input/en_US.news.txt")
saveRDS(en_US.news.txt.stat, "./output/en_US.news.txt.stat.rds")
} else {
en_US.news.txt.stat <- readRDS("./output/en_US.news.txt.stat.rds")
}
print(en_US.news.txt.stat)
## $`Size (Mb)`
## [1] 196
##
## $`Line count`
## [1] 77259
##
## $`Word count`
## [1] 2689560
File en_US.twitter.txt stat:
if (!file.exists("./output/en_US.twitter.txt.stat.rds")) {
en_US.twitter.txt.stat <- GetTextFileStat("./input/en_US.twitter.txt")
saveRDS(en_US.twitter.txt.stat, "./output/en_US.twitter.txt.stat.rds")
} else {
en_US.twitter.txt.stat <- readRDS("./output/en_US.twitter.txt.stat.rds")
}
print(en_US.twitter.txt.stat)
## $`Size (Mb)`
## [1] 159
##
## $`Line count`
## [1] 2360148
##
## $`Word count`
## [1] 30193150
Data should be cleaned out of the meaningless parts. For example, punctation and stop words (is, are, the, a) are not relevant to the word prediction task. Following should be cleaned: punctation, numbers, stop wrods, profanity words, swear words, URLs, emails, accounts.
Accroding to the file’s statistics, data is too big to handle it in memory. So data should be sampled before any analysis.
According to holdout method. Combined data should be split up into groups: training and test. More complicated techniques, like k-fold, need much more time and memory. But to pave the way for such techniques, data should be split up into small pieces/chunks while cleaning.
chunk.line.count <- 8L*1024L
PartitionCorpus(in.dir="./input/", out.dir="./output/cleaned-corpus",
chunk.reader=function(con) {
lines <- readLines(con, n=chunk.line.count,
warn=F, skipNul=T)
corpus <- CleanCorpus(VCorpus(VectorSource(lines)))
return(corpus)
},
chunk.skiper=function(con) {
readLines(con, n=chunk.line.count,
warn=F, skipNul=T)
})
N-gram is a contiguous sequence of n items from a given sequence of text.
In our case item is a word.
N-gram’s frequencies can help:
Lets compute term-document matrix and n-gram frequency table for each chunk. Then combine 5% of them for further analysis.
tokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1L, max=4L))
MapPartitions(f=function(corpus) TermDocumentMatrix(corpus, control=list(tokenize=tokenizer)),
in.dir="./output/cleaned-corpus",
out.dir="./output/term-doc-matrix")
MapPartitions(f=ToNGramFreqTable,
in.dir="./output/term-doc-matrix",
out.dir="./output/freq-table")
ngram.freq.files <- list.files("./output/freq-table", pattern=".rds$", full.names=T)
set.seed(123L)
ngram.freq.files.index <- rbinom(length(ngram.freq.files), size=1L, prob=0.05) == 1L
sampled.ngram.freq.files <- ngram.freq.files[ngram.freq.files.index]
train.freq.table <- CombinePartitions(sampled.ngram.freq.files,
reduce.f=MergeNGramFreqTables)
summary(train.freq.table)
## first.words last.word n freq
## Length:5325927 Length:5325927 Min. :1.000 Min. : 1.00
## Class :character Class :character 1st Qu.:2.000 1st Qu.: 1.00
## Mode :character Mode :character Median :3.000 Median : 1.00
## Mean :3.006 Mean : 1.56
## 3rd Qu.:4.000 3rd Qu.: 1.00
## Max. :4.000 Max. :16462.00
Unigram - separate word.
DrawNGrams <- function(ngram.freq) {
ordered.ngram.freq <- ngram.freq[, list(ngram=paste(first.words,
last.word),
freq)]
ordered.ngram.freq$ngram <- reorder(ordered.ngram.freq$ngram,
ordered.ngram.freq$freq)
ggplot(ordered.ngram.freq, aes(x=ngram, y=freq)) +
geom_bar(stat="identity") + coord_flip() +
xlab("N-gram") + ylab("Frequency")
}
unigram.freq <- train.freq.table[n == 1][order(-freq)]
DrawNGrams(head(unigram.freq, top.word.count))
hist(log2(unigram.freq$freq), main="Unigram's Frequencies Histogram", xlab="Log Frequency", ylab="")
Significant part of dictionary consists of low-frequency words.
gram2.freq <- train.freq.table[n == 2L][order(-freq)]
DrawNGrams(head(gram2.freq, top.word.count))
hist(log2(gram2.freq$freq), main="2-Ggram's Frequencies Histogram", xlab="Log Frequency", ylab="")
Similarly to unigram’s histogram - here we have a lot of rare terms. Obviously 2-gram’s frequencies are ~10 times lower than unigram’s frequencies, but 2-gram’s dictionary is ~10 times larger.
gram3.freq <- train.freq.table[n == 3L][order(-freq)]
DrawNGrams(head(gram3.freq, top.word.count))
hist(log2(gram3.freq$freq), main="3-Ggram's Frequencies Histogram", xlab="Log Frequency", ylab="")
gram4.freq <- train.freq.table[n == 4L][order(-freq)]
DrawNGrams(head(gram4.freq, top.word.count))
hist(log2(gram4.freq$freq), main="4-Ggram's Frequencies Histogram", xlab="Log Frequency", ylab="")
Coverage is useful metric to estimate potential effectiveness of the prediction model.
How dictionary size impacts corpora coverage?
dictionary.size.to.coverage <- cumsum(unigram.freq$freq * 100 / sum(unigram.freq$freq))
plot(x=1:length(dictionary.size.to.coverage),
y=dictionary.size.to.coverage,
type="l", main="Coverage", xlab="Dictionary Size (words)", ylab="Coverage (percent)")
It’s evident from the plot that coverage has logarithmic dependency on dictionary size. To make better predictions dictionary may be reduced and cleaned from very rare or occasional words and phrases.
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form
Stemming can help to increase coverage and reduce dictionary size. But the result of the prediction should be some word, not word stem. So each stem should be assigned to some set of words. Such complicated technique should be justified by its high efficiency.
N-gram’s frequencies (computed above) are baised. Therefore coverage estimation is baised too. To estimate language coverage external sources could be used. For example:
Another way is to use heuristic techniques. For example, Good-Turing.
Good–Turing frequency estimation is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species.
In other words, probability of unseen terms is estimated from already known frequencies distribution (using binomial ditribution assumption).
Probability of unseen unigram:
paste(round(goodTuring(unigram.freq$freq)$P0 * 100, 2), "%")
## [1] "2.97 %"
Probability is very low. Some rough estimation is needed.
How many words are there in the english language?
The Second Edition of the 20-volume Oxford English Dictionary contains full entries for 171,476 words in current use…
Number of words in dictionary equals to 122477. So language coverage could be astimated as less than corpus coverage for about 1.4 times.
Basic modeling report is located here.