Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. Smart keyboard makes it easier for people to type on their mobile devices. One cornerstone of smart keyboard is predictive text models.

When someone types:

I went to the

the keyboard presents several options for what the next word might be. For example:

gym, store, restaurant

This project is aimed on building predictive text models like those used by SwiftKey.

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this report is to understand the basic relationships observed in the data and prepare to build first linguistic models.

Steps:

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider:

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words are needed in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  4. How to evaluate how many of the words come from foreign languages?
  5. A way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases.

Downloading the Data

if (!file.exists("./raw-data.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile="./raw-data.zip", method="auto")
}
if (!file.exists("./input")) {
    unzip("./raw-data.zip", files=c("final/en_US/en_US.blogs.txt", "final/en_US/en_US.news.txt", "final/en_US/en_US.twitter.txt"), exdir="input", junkpaths=T)
}

Data looks like simple text files. Each line inside text file represents individual message or post.

Data File’s Statistics

Sizes, line couts, and word counts.

GetTextFileStat <- function(file.path) {
    size.mb <- round(file.info(file.path)$size / 1024 / 1024)
    lines <- readLines(file.path, warn=F, skipNul=T)
    word.count <- sum(stri_count_words(lines))
    return(list("Size (Mb)"=size.mb, "Line count"=length(lines), 
                "Word count"=word.count))
}

File en_US.blogs.txt stat:

if (!file.exists("./output/en_US.blogs.txt.stat.rds")) {
    en_US.blogs.txt.stat <- GetTextFileStat("./input/en_US.blogs.txt")
    saveRDS(en_US.blogs.txt.stat, "./output/en_US.blogs.txt.stat.rds")
} else {
    en_US.blogs.txt.stat <- readRDS("./output/en_US.blogs.txt.stat.rds")
}
print(en_US.blogs.txt.stat)
## $`Size (Mb)`
## [1] 200
## 
## $`Line count`
## [1] 899288
## 
## $`Word count`
## [1] 38031339

File en_US.news.txt stat:

if (!file.exists("./output/en_US.news.txt.stat.rds")) {
    en_US.news.txt.stat <- GetTextFileStat("./input/en_US.news.txt")
    saveRDS(en_US.news.txt.stat, "./output/en_US.news.txt.stat.rds")
} else {
    en_US.news.txt.stat <- readRDS("./output/en_US.news.txt.stat.rds")
}
print(en_US.news.txt.stat)
## $`Size (Mb)`
## [1] 196
## 
## $`Line count`
## [1] 77259
## 
## $`Word count`
## [1] 2689560

File en_US.twitter.txt stat:

if (!file.exists("./output/en_US.twitter.txt.stat.rds")) {
    en_US.twitter.txt.stat <- GetTextFileStat("./input/en_US.twitter.txt")
    saveRDS(en_US.twitter.txt.stat, "./output/en_US.twitter.txt.stat.rds")
} else {
    en_US.twitter.txt.stat <- readRDS("./output/en_US.twitter.txt.stat.rds")
}
print(en_US.twitter.txt.stat)
## $`Size (Mb)`
## [1] 159
## 
## $`Line count`
## [1] 2360148
## 
## $`Word count`
## [1] 30193150

Cleaning and Sampling

Data should be cleaned out of the meaningless parts. For example, punctation and stop words (is, are, the, a) are not relevant to the word prediction task. Following should be cleaned: punctation, numbers, stop wrods, profanity words, swear words, URLs, emails, accounts.

Accroding to the file’s statistics, data is too big to handle it in memory. So data should be sampled before any analysis.

According to holdout method. Combined data should be split up into groups: training and test. More complicated techniques, like k-fold, need much more time and memory. But to pave the way for such techniques, data should be split up into small pieces/chunks while cleaning.

chunk.line.count <- 8L*1024L
PartitionCorpus(in.dir="./input/", out.dir="./output/cleaned-corpus",
                chunk.reader=function(con) {
                    lines <- readLines(con, n=chunk.line.count,
                                       warn=F, skipNul=T)
                    corpus <- CleanCorpus(VCorpus(VectorSource(lines)))
                    return(corpus)
                },
                chunk.skiper=function(con) {
                    readLines(con, n=chunk.line.count,
                              warn=F, skipNul=T)
                })

N-gram’s Frequencies

N-gram is a contiguous sequence of n items from a given sequence of text.

In our case item is a word.

N-gram’s frequencies can help:

Lets compute term-document matrix and n-gram frequency table for each chunk. Then combine 5% of them for further analysis.

tokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1L, max=4L))
MapPartitions(f=function(corpus) TermDocumentMatrix(corpus, control=list(tokenize=tokenizer)),
              in.dir="./output/cleaned-corpus",
              out.dir="./output/term-doc-matrix")
MapPartitions(f=ToNGramFreqTable,
              in.dir="./output/term-doc-matrix",
              out.dir="./output/freq-table")
ngram.freq.files <- list.files("./output/freq-table", pattern=".rds$", full.names=T)
set.seed(123L)
ngram.freq.files.index <- rbinom(length(ngram.freq.files), size=1L, prob=0.05) == 1L
sampled.ngram.freq.files <- ngram.freq.files[ngram.freq.files.index]
train.freq.table <- CombinePartitions(sampled.ngram.freq.files,
                                      reduce.f=MergeNGramFreqTables)
summary(train.freq.table)
##  first.words         last.word               n              freq         
##  Length:5325927     Length:5325927     Min.   :1.000   Min.   :    1.00  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:    1.00  
##  Mode  :character   Mode  :character   Median :3.000   Median :    1.00  
##                                        Mean   :3.006   Mean   :    1.56  
##                                        3rd Qu.:4.000   3rd Qu.:    1.00  
##                                        Max.   :4.000   Max.   :16462.00

Unigram’s Frequencies

Unigram - separate word.

DrawNGrams <- function(ngram.freq) {
    ordered.ngram.freq <- ngram.freq[, list(ngram=paste(first.words,
                                                        last.word),
                                            freq)]
    ordered.ngram.freq$ngram <- reorder(ordered.ngram.freq$ngram,
                                        ordered.ngram.freq$freq)
    ggplot(ordered.ngram.freq, aes(x=ngram, y=freq)) +
            geom_bar(stat="identity") + coord_flip() +
            xlab("N-gram") + ylab("Frequency")
}
unigram.freq <- train.freq.table[n == 1][order(-freq)]
DrawNGrams(head(unigram.freq, top.word.count))

hist(log2(unigram.freq$freq), main="Unigram's Frequencies Histogram", xlab="Log Frequency", ylab="")

Significant part of dictionary consists of low-frequency words.

2-Gram’s Frequencies

gram2.freq <- train.freq.table[n == 2L][order(-freq)]
DrawNGrams(head(gram2.freq, top.word.count))

hist(log2(gram2.freq$freq), main="2-Ggram's Frequencies Histogram", xlab="Log Frequency", ylab="")

Similarly to unigram’s histogram - here we have a lot of rare terms. Obviously 2-gram’s frequencies are ~10 times lower than unigram’s frequencies, but 2-gram’s dictionary is ~10 times larger.

3-Gram’s Frequencies

gram3.freq <- train.freq.table[n == 3L][order(-freq)]
DrawNGrams(head(gram3.freq, top.word.count))

hist(log2(gram3.freq$freq), main="3-Ggram's Frequencies Histogram", xlab="Log Frequency", ylab="")

4-Gram’s Frequencies

gram4.freq <- train.freq.table[n == 4L][order(-freq)]
DrawNGrams(head(gram4.freq, top.word.count))

hist(log2(gram4.freq$freq), main="4-Ggram's Frequencies Histogram", xlab="Log Frequency", ylab="")

Coverage

Coverage is useful metric to estimate potential effectiveness of the prediction model.

Corpora Coverage

How dictionary size impacts corpora coverage?

dictionary.size.to.coverage <- cumsum(unigram.freq$freq * 100 / sum(unigram.freq$freq))
plot(x=1:length(dictionary.size.to.coverage),
     y=dictionary.size.to.coverage,
     type="l", main="Coverage", xlab="Dictionary Size (words)", ylab="Coverage (percent)")

It’s evident from the plot that coverage has logarithmic dependency on dictionary size. To make better predictions dictionary may be reduced and cleaned from very rare or occasional words and phrases.

Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form

Stemming can help to increase coverage and reduce dictionary size. But the result of the prediction should be some word, not word stem. So each stem should be assigned to some set of words. Such complicated technique should be justified by its high efficiency.

Language Coverage

N-gram’s frequencies (computed above) are baised. Therefore coverage estimation is baised too. To estimate language coverage external sources could be used. For example:

  • Microsoft Web N-gram
  • Google Ngram Viewer

Another way is to use heuristic techniques. For example, Good-Turing.

Good–Turing frequency estimation is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species.

In other words, probability of unseen terms is estimated from already known frequencies distribution (using binomial ditribution assumption).

Probability of unseen unigram:

paste(round(goodTuring(unigram.freq$freq)$P0 * 100, 2), "%")
## [1] "2.97 %"

Probability is very low. Some rough estimation is needed.

How many words are there in the english language?

The Second Edition of the 20-volume Oxford English Dictionary contains full entries for 171,476 words in current use…

Number of words in dictionary equals to 122477. So language coverage could be astimated as less than corpus coverage for about 1.4 times.

Conclusions

  1. Significant part of dictionary consists of very rare words. So dictionary can be reduced.
  2. Corpus coverage has logarithmic dependency on dictionary size.
  3. Most efficient model could cover only ~71% of the language.
  4. Stemming can help to increase coverage. But this is complicated technique.