1. Introduction

The goal of the Coursera’s Data Science Capstone Project is to create a Shiny application prototype of Smart Keyboard Predictive Model with several options for what the next word might be.

The motivation for present Milestone Project is to:

Demonstrate that the Capstone Dataset have been successfully loaded.
Explain the summary statistics about the Capstone Dataset.
Describe the major features of the Capstone Dataset
Draw up the plans for creating the predictive model.

2. Getting and Cleaning the Data

2.1. Data Downloading

Capstone Dataset was downloaded in August 23, 2016 as zip archive at from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

2.2. Entire Data Summary

This Dataset contains 4 collections of texts on English, German, Finnish, and Russian languages respectively. Each collection includes 3 plain text files containing corpora gained from blogs, news, and Twitter feeds.

Only English text collection (containing in files with prefix en_US) will be used throughout the present investigation and modeling.

Next table includes some information about the entire Capstone Dataset (code-chunk for this table is available in Appendix A):

Corpus	File Name	File Size (MB)	Lines Count	Words Count (in ths.)
Blogs	en_US.blogs.txt	200.4	899288	37546.2
News	en_US.news.txt	196.3	1010244	34762.4
Twitter	en_US.twitter.txt	159.4	2360148	30093.4

As we can see from the table, all text corpora are almost similar in size and count of words. From this point of view, corpora of Blogs and News texts looks quite similar, while the Twitter corpus more than twice longer (in terms of count of records) than two others. This evidently can be explained by the 140-chars limit of the twitter posts.

2.3. Cleaning the Data

Capstone Dataset is really fairly large. Therefore, next analysis was provided on random 10% sub-samples of the original data.

Dictionary of Profanity Words

Profanity is a class of lexicon that includes rude, vulgar or another types of words and phrases which seems inappropriate to use as suggestions in predictive model because their appearance can insult users. Next collection of offensive words was composed of two lists obtained from:

https://gist.github.com/jamiew/1112488 - 723 words
http://www.frontgatemedia.com/new/wp-content/uploads/2014/03/Terms-to-Block.csv - 451 words.

Their combination gives dictionary of 1008 unique profanity words and its spelling variations. Code-chunk is available in Appendix B.

Routines of Cleaning

The following ideas were taken speculative (see Appendix C for the code-chunk). They can be confirmed or refused at the stage of assessing the model’s performance:

Transforming to lower case.
Removing of punctuation marks. On the one hand, it is not necessary to use a punctuation marks in predictive model because “typing of dots and commas” is quite quick. On the another hand, removing punctuation before building n-gram models can lead to the errors on the borders of sentences, because some n-grams will be combined of the words from two different sentences. Nevertheless, such n-gram tokens will not be frequent, and their presence in model will not make a noticeable influence on the quality of prediction.
Removing of numbers. In most cases numbers (in dates, counts, phone numbers, etc.) are useless for purpose of words prediction because of their high variability in the same n-grams.
Removing of extra whitespaces.
Removing of markup elements such as URLs, emails, and hashtags.

Next transformations was not applied on the present stage (of course, this are also the hypotheses which must be tested by a quality assessment of the final predictive model):

Removing stop-words. In some tasks (such as information retrieval, text classification, etc.) stop-words carry little information and decrease models quality, so they are often removed from texts. But in case of text predictive models they are important because they can speed up text typing due to their frequency in the texts.
Stemming. Stemming is used as a method for grouping words with a similar basic meaning. It can significantly decrease the size of n-gram model, but in case on word prediction stemming can reduce recall as well.

Data cleaning code-chunk is available in Appendix C.

4. N-Gram Models and Exploratory Analysis

To answer the questions about word’s distribution and to compare frequencies of words, 2-grams, and 3-grams in the investigated corpora - 9 term-document matrixes were built. Next table shows the amount of n-gram tokens in each matrix (code-chunk is in Appendix D):

	Words	2-Grams	3-Grams
Blogs	70280	666401	1347793
News	69692	678318	1277994
Twitter	68529	525644	946438

Next plots show the Top-30 tokens from all 9 frequency-sorted dictionaries (Appendix E):

As we can see on the plots above, frequency distribution of n-gram tokens in all corpuses has a similar properties:

at the beginning of the frequency-sorted dictionaries frequency of tokens decreases quite fast - it is a short and heavy “beak”;
then, the decrease rate of the words’ frequency slows down, so the bottom part of the sorted dictionary makes a long “tail”.

Next tables shows the dependence of the n-gram token coverage depending on the “beak’s” length (Appendix F):

Blogs

Level	Words	2-Grams	3-Grams
50%	0.3	3.9	35.5
90%	12.5	73.3	87.1
95%	27.7	86.6	93.6

News

Level	Words	2-Grams	3-Grams
50%	0.6	5.9	38.3
90%	14.6	76.0	87.7
95%	30.3	88.0	93.8

Twitter

Level	Words	2-Grams	3-Grams
50%	0.3	4.1	35.6
90%	11.0	74.6	87.1
95%	28.2	87.3	93.6

Also we can see that while the value of n (in n-grams) increases, the “beak” frequency-sorted dictionary becomes longer and lighter, and the “tail” becomes heavier. For example, top 0.3% from the frequency-sorted dictionary of Blogs words covers 50% of all its word (one should remember that stop-words hasn’t been filtered at the cleaning stage). At the same time, if we are going to cover 95% of all 3-grams in News corpus we should take 93,8% of the top elements of frequency-sorted 3-grams dictionary.

5. Conclusion

For a fixed n, terms distribution in a top of frequency-sorted dictionaries looks like almost the same, except several specific tokens. (For example, “he said” in News corpus, “i love” and “thank you” in Twitter corpus.) This means that all 3 corpora could be merged and processed as one corpus.

By changing the size of the frequency-sorted token dictionaries it is possible to effectively control the coverage of corpora and memory usage. At the same time if we want to predict more n-grams it is necessary to develop the effective way of dictionary storage in memory.

6. Further Steps

To reach the goal of Data Science Capstone Project it is necessary to:

Combine the corpora of Blogs, News, and Twitter texts into one and split it into training and test datasets.
Create criteria for the quality of the Smart Keyboard Predictive Model and method of its evaluation.
Build time/memory-efficient n-gram predictive model (n = 2, 3, 4) with maximum percent of coverage on the train and test collections.
Make backoff model to estimate the probability of unobserved n-grams.
Develop and deploy data product (Shiny application + presentation) implementing the prototype of Smart Keyboard Predictive Model.

7. Appendix

Appendix A

require(knitr)
require(tm)
require(RWeka)
require(ggplot2)
require(plyr)
require(grid)
require(gridExtra)
require(stringi)
require(reshape2)
require(slam)

sample_size <- 0.1
corpusList <- c("Blogs", "News", "Twitter")

fileInfo <- data.frame(
    Corpus = corpusList,
    File = numeric(length = length(corpusList)),
    Size = numeric(length = length(corpusList)),
    Rows = numeric(length = length(corpusList)),
    Words = numeric(length = length(corpusList))
)

LoadCorpus <- function(corp) {
    fileName <- paste0('./final/en_US/en_US.', tolower(corp), '.txt')
    file <- file(fileName, open="rb")
    data <- readLines(file, encoding = "UTF-8", skipNul = TRUE)
    close(file)
    fileInfo[fileInfo$Corpus == corp, "File"] <<- paste0('en_US.', tolower(corp), '.txt')
    fileInfo[fileInfo$Corpus == corp, "Size"] <<- round(file.info(fileName)$size / 1024^2, 1)
    fileInfo[fileInfo$Corpus == corp, "Rows"] <<- length(data)
    fileInfo[fileInfo$Corpus == corp, "Words"] <<- round(sum(stri_count_words(data)) / 1e3, 1)
    data <- sample(data, length(data) * sample_size)
    data <- iconv(data, from = "UTF-8", to = "ASCII", sub = "")
    return(VCorpus(VectorSource(data)))
}

for (corp in corpusList) {
    expr <- paste0("corpus", corp,
                   ' <- LoadCorpus("', corp, '")')
    eval(parse(text = expr))
    gc()
}

names(fileInfo) = c("Corpus", "File Name", "File Size (MB)",
                    "Lines Count", "Words Count (in ths.)")

kable(fileInfo, format = "markdown")

Appendix B

profanityWords1 <- readLines("./final/profanity_words/Terms-to-Block.csv")
profanityWords1 <- profanityWords1[grep('^,.+', profanityWords1)]
profanityWords1 <- sub('^,"*(.*?),"*.*?$', '\\1', profanityWords1)
profanityWords1 <- sub("[[:punct:]]", "", profanityWords1)

profanityWords2 <- readLines("./final/profanity_words/google_twunter_lol")
profanityWords2 <- profanityWords2[grep('\\:', profanityWords2)]
profanityWords2 <- sub('^"*(.*?)"*\\:\\d+,.*?$', '\\1', profanityWords2)
profanityWords2 <- sub('^\\s+|\\s+$', '\\1', profanityWords2)

profanityWords <- unique(c(profanityWords1, profanityWords2))

Appendix C

CleanCorpus <- function(corp, profan) {
    removeRegExp <- content_transformer(function(x, pattern) gsub(pattern, "", x))
    corp <- tm_map(corp, PlainTextDocument)
    corp <- tm_map(corp, removePunctuation)
    corp <- tm_map(corp, removeNumbers)
    corp <- tm_map(corp, content_transformer(tolower))
    corp <- tm_map(corp, removeWords, profan)
    corp <- tm_map(corp, removeRegExp, " ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)")
    corp <- tm_map(corp, removeRegExp, "[[:alnum:]]+\\@[[:alpha:]]+\\.com(\\.[a-z]{2})?")
    corp <- tm_map(corp, removeRegExp, "#\\w+")
    corp <- tm_map(corp, stripWhitespace)
    return(corp)
}

for (corp in corpusList) {
    expr <- paste0("corpus", corp,
                   " <- CleanCorpus(corpus", corp, ", ", "profanityWords)")
    eval(parse(text = expr))
    gc()
}

Appendix D

NGramTDMatrix <- function(corpus, n) {
    NGTokenizer <- function(x)
        NGramTokenizer(x, Weka_control(min = n, max = n))
    return(TermDocumentMatrix(corpus, control = list(tokenize = NGTokenizer)))
}
for (corp in corpusList) {
    for (n in 1:3) {
        expr <- paste0("matrix", corp, n, 
                       "Gram <- NGramTDMatrix(corpus", corp, ", ", n, ")")
        eval(parse(text = expr))
        gc()
    }
}
rm(corpusBlogs, corpusNews, corpusTwitter)

NGramFreq <- function(tdm) {
    freq<- row_sums(tdm)
    return(data.frame(word = names(freq), freq = freq))
}
for (corp in corpusList) {
    for (n in 1:3) {
        expr <- paste0("freq", corp, n, 
                       "Gram <- NGramFreq(matrix", corp, n, "Gram)")
        eval(parse(text = expr))
        gc()
    }
}

tokensInfo <- expand.grid(
    Corpus = corpusList,
    nGram = 1:3,
    Amount = 0
)
for (corp in corpusList) {
    for (n in 1:3) {
        expr <- paste0("tokensNGram <- freq", corp, n, "Gram")
        eval(parse(text = expr))
        tokensInfo[tokensInfo$Corpus == corp &
                   tokensInfo$nGram == n, "Amount"] = sum(tokensNGram$freq > 0)
    }
}

tokensInfoWide <- dcast(tokensInfo, Corpus ~ nGram, value.var = "Amount")
kable(tokensInfoWide, format = "markdown",
      col.names = c("", "Words", "2-Grams", "3-Grams"))

Appendix E

NGramFreqPlot <- function(freqNG, title, cnt) {
    freqNG <- arrange(freqNG, desc(freq), word)
    freqNG <- rbind(head(freqNG, cnt))
    return(ggplot(freqNG, aes(x = reorder(word, freq), y = freq)) +
               geom_bar(stat = "identity") +
               ggtitle(title) +
               labs(x = "", y = "") +
               geom_text(aes(label = freq), position = "identity",
                         hjust = -0.3, size = 3) +
               scale_y_continuous(expand = c(0, 0),
                                  limits = c(0, 1.2 * max(freqNG$freq))) +
               coord_flip())
}

NGramFreqGrid <- function(freqLeft, freqCenter, freqRight, cnt, n) {
    ggpLeft <- ggplot_gtable(ggplot_build(NGramFreqPlot(freqLeft, "Blogs", cnt)))
    ggpCenter <- ggplot_gtable(ggplot_build(NGramFreqPlot(freqCenter, "News", cnt)))
    ggpRight <- ggplot_gtable(ggplot_build(NGramFreqPlot(freqRight, "Twitter", cnt)))
    if (n == 1) {
        title <- paste0("Frequency of Top-", cnt, " Words")
    } else {
        title <- paste0("Frequency of Top-", cnt, " ", n, "-grams")
    }
    grid.arrange(ggpLeft, ggpCenter, ggpRight,
                 top = textGrob(title, gp=gpar(cex=2)),
                 ncol = 3, widths = c(1/3, 1/3, 1/3))    
}

NGramFreqGrid(freqBlogs1Gram, freqNews1Gram, freqTwitter1Gram, 30, 1)
NGramFreqGrid(freqBlogs2Gram, freqNews2Gram, freqTwitter2Gram, 30, 2)
NGramFreqGrid(freqBlogs3Gram, freqNews3Gram, freqTwitter3Gram, 30, 3)

Appendix F

freqInfo <- expand.grid(
    Corpus = corpusList,
    nGram = 1:3,
    Level = c(50, 90, 95),
    Amount = 0
)

for (corp in corpusList) {
    for (n in 1:3) {
        expr <- paste0("freqNGram <- arrange(freq", corp, n, "Gram, desc(freq))")
        eval(parse(text = expr))
        for (level in c(50, 90, 95)) {
            freqNGram$cumsum <- cumsum(freqNGram$freq) / sum(freqNGram$freq)
            freqNGram$amount <- 1:length(freqNGram$cumsum) / length(freqNGram$cumsum)
            freqInfo[freqInfo$Corpus == corp &
                     freqInfo$nGram == n &
                     freqInfo$Level == level, "Amount"] = 
                round(100 * min(freqNGram[freqNGram$cumsum >= level / 100, "amount"]), 1)
        }
    }
}

LevelTable <- function(corp) {
    freqInfoWide <- dcast(freqInfo[freqInfo$Corpus == corp,],
                          paste0(Level, "%") ~ nGram, value.var = "Amount")
    kable(freqInfoWide, format = "markdown",
          col.names = c("Level", "Words", "2-grams", "3-grams"))
}

LevelTable("Blogs")
LevelTable("News")
LevelTable("Twitter")

Smart Keyboard Predictive Model - Milestone Report

Dmitry Belyaev

September 3, 2016