Introduction

With the rapid spread of mobile devices that increasingly serve as commdication hubs, it becomes imperative to develop tools that allow users to write text quickly while on the move. Predictive text has been around since the early days of mobile phones but were initially just spellchek and word completion applications.
Modern computing power and techniques now allow a much cleverer approach that can predict not only letters, but also the next word given the first couple of words in a phrase.
The Capstone Project to the JHU Data Science Specialization on Coursera consists of writing such an application, starting from three large corpora of text taken from news, blogs and Twitter.
In this milestone document, I will report my initial findings about the nature and distribution of these data. Most of the work is done using the tm package by Ingo Feinerer & Kurt Hornik.

Data Load and Corpus Construction

We are provided with copora in English, German, Finnish and Russian. However in the context of this project, I am only going to use the English copora. Within these, we have text collected from blogs, online news and Twitter. After loading the data, I look at the number of text documents in each corpus:

##                          Blogs     News  Twitter
## Num.lines               899288  1010242  2360148
## Max.num.char.per.line    40835    11384      213
## Num.words.per.file    44729614 41076715 37108157

In further steps, all three are combined into one large corpus. Given the size of this corpus, and at least for the exploratory and prototyping phases, I am going to use only a small random sample of the data. I use 1%, which will make computation times and memory requirements much more reasonable.

Pre-processing

We now have a complete corpus of texts. In order to make it more usable, some data manipulation is required. I apply the following transformations:

Note: I decided against stemming words and removing stop words. These two steps are useful when trying to classify texts by subject for instance, but in the present case the task is to predict the next word in a phrase so that the phrase makes sense. In many cases, the next appropriate word will be a stop word. Even if it is not, we want the application to suggest a full word, not just a stem.

The last pre-processing step is to remove profanity words. It is assumed that most users would not expect these to be suggested by the application, especially as children could be among the users. Any word from the corpus found within a list of 1,300 profanity words is replaced with <PROF>.

Note: The list used includes words that can be used as profanity in certain contexts, but would be perfectly acceptable in others. This will result in a certain loss of accuracy in the model. In later phases of the project, we may have to review this list manually or find a less stringent one.

Here is an sample of text from our corpus before pre-processing:

## [1] "Yo Bro... That Original was sweet!!!!! Your Falsetto is Money!!!"

And after pre-processing:

## [1] "yo bro that original was sweet your falsetto is money"

Word and N-gram distribution

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  6599102 352.5    9968622  532.4   9968622  532.4
## Vcells 93763442 715.4  267520299 2041.1 266058899 2029.9

In order to study the distribution of words in the corpus, we take advantage of the TermDocMatrix() function in the tm package in R, which will parse the texts into words. It also offers the possibility of parsing N-grams so we will also build digram and trigram matrices.

The data summary for each shows that they are very sparse:

# Let's look at the TDMs:
term_doc_uni
## <<TermDocumentMatrix (terms: 55379, documents: 42821)>>
## Non-/sparse entries: 692251/2370691908
## Sparsity           : 100%
## Maximal term length: 57
## Weighting          : term frequency (tf)
term_doc_di
## <<TermDocumentMatrix (terms: 435962, documents: 42821)>>
## Non-/sparse entries: 947340/18667381462
## Sparsity           : 100%
## Maximal term length: 68
## Weighting          : term frequency (tf)
term_doc_tri
## <<TermDocumentMatrix (terms: 771694, documents: 42821)>>
## Non-/sparse entries: 919833/33043788941
## Sparsity           : 100%
## Maximal term length: 84
## Weighting          : term frequency (tf)

We need to reduce their sparsity in order to be able to process them on a personal computer but the matrix content is very sensitive to the value of the threshold parameter. The number of different symbols (words / N-grams) drops very rapidly with the threshold. I therefore decided to remove words that appear less than five times.

Let us look at the summaries again:

# Let's look at the TDMs:
term_doc_uni
## <<TermDocumentMatrix (terms: 10246, documents: 42821)>>
## Non-/sparse entries: 621339/438122627
## Sparsity           : 100%
## Maximal term length: 16
## Weighting          : term frequency (tf)
term_doc_di
## <<TermDocumentMatrix (terms: 18652, documents: 42821)>>
## Non-/sparse entries: 416973/798280319
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)
term_doc_tri
## <<TermDocumentMatrix (terms: 6317, documents: 42821)>>
## Non-/sparse entries: 78344/270421913
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

This pruning had maximum effect on the trigram matrix.

I then extract the counts or each word in each TDM and rank them in descending order and plot the graph in logarithmic scale for both x and y. I also print out the 20 most frequent symbols:

##    the    and    for   that <prof>    you   with    was   this   have 
##  48091  24216  11132  10707   9911   9461   7161   6448   5469   5332 
##    are    but    not   from   they    all   will    its   said    his 
##   4998   4826   4260   3802   3400   3322   3125   3112   3100   3089

##     of the     in the     to the    for the     on the      to be 
##       4461       4180       2123       2039       1895       1658 
##     at the    and the       in a   with the       is a the <prof> 
##       1482       1236       1216       1083       1064       1054 
##      for a     it was   from the      i was     i have      it is 
##        961        954        903        890        866        839 
##     with a      and i 
##        836        824

##     one of the       a lot of thanks for the    going to be        to be a 
##            351            294            254            182            178 
##       it was a     the end of    some of the      i want to     as well as 
##            155            153            151            147            145 
##     out of the     be able to        i don t    part of the       i have a 
##            141            137            137            135            133 
##  of the <prof>      i have to    a couple of    is going to      this is a 
##            123            114            110            110            110

As expected, “stopwords” represent the vast majority of the 20 most frequent words/N-grams. They do not carry much information but in the context of this application, are still required to form grammatically correct sentences.
The three models demonstrate a typical “Zipfian” distribution were the log of each symbol’s rank is roughly proportional to the log of its count. This means that there is a few frequent words or N-greams and a long tail of rare ones. We can also see the effects of the profanity filter, where the string <PROF> appears in an artificially high number of documents as it stands for numerous other words.

Performance considerations

Even with a relatively limited sample from the corpus, some operations are computationally expensive and nearly saturate the memory of an 8GB PC running Windows 10 (in particular building the term-document matrices).

In the final product, we will want to use as much data as possible to train the model and this will require some technique for storing the data on disk rather than in memory. Speed of training is not critical except for the developper. The end-user however will expect immediate responses from the application so speed of execution is vital.

Future development

The next steps will be to develop a Markov chain model and train it to make accurate predictions. To achieve this, as explained above, we will need to develop ways of using the very large dataset that is provided, or as much of it as possible – keeping in mind that part of it will be required as validation and test sets. To ensure quick reaction times from the application, we are considering building hash tables with the N-grams as keys and a frequency metric as values. This metric might be frequency, smoothed frequency (to account for unknown words) or some other number.


Appendix: Detailed Code

# Load required libraries
require(readr); require(slam)
require(rJava); require(openNLP)
require(tm); require(stringi)
require(filehash); require(SnowballC); require(hash)
if("Rgraphviz" %in% rownames(installed.packages()) == FALSE) {
    source("http://bioconductor.org/biocLite.R")
    biocLite("Rgraphviz")
}
require(Rgraphviz)
set.seed(1234)
blogs.path <- "final/en_US/en_US.blogs.txt"
twitter.path <- "final/en_US/en_US.twitter.txt"
news.path <- "final/en_US/en_US.news.txt"
blogs.con <- file(blogs.path, "rb")
news.con <- file(news.path, "rb")
twitter.con <- file(twitter.path, "rb")
blogs.rawdata <- iconv(readLines(blogs.con, encoding = "UTF-8", 
                                 skipNul = TRUE), "UTF-8", "ascii", sub = " ")
news.rawdata <- iconv(readLines(news.con, encoding = "UTF-8", 
                                skipNul = TRUE), "UTF-8", "ascii", sub = " ")
twitter.rawdata <- iconv(readLines(twitter.con, encoding = "UTF-8", 
                                   skipNul = TRUE), "UTF-8", "ascii", sub = " ")
sourceVect <- c(blogs.rawdata, news.rawdata, twitter.rawdata)
close(blogs.con); close(news.con); close(twitter.con)

# Basic statistics about these data:
rawfiles <- list(blogs.rawdata, news.rawdata, twitter.rawdata)
files_summary <- sapply(rawfiles, function(x) {
    x <-  unname(x)
    c1 <- length(x)
    c2 <- max(nchar(x))
    c3 <- length(unlist(strsplit(x, "([[:punct:]]|\\s)")))
    cbind(c1, c2, c3)
})

files_summary <- as.data.frame(files_summary, row.names = c("Num.lines", "Max.num.char.per.line", "Num.words.per.file"))
colnames(files_summary) <- c("Blogs", "News", "Twitter")
files_summary

# Take a random sample of the data and write it to file:
N <- length(sourceVect)
sourceSample <- sourceVect[rbinom(N, 1, 0.01) == 1]
writeLines(sourceSample, con  = "sample/text_samples.txt", sep = "\n")
rm(blogs.rawdata)
rm(news.rawdata)
rm(twitter.rawdata)
rm(sourceVect)

# Build corpus:
corpus_raw <- VCorpus(VectorSource(sourceSample), 
                    readerControl = list(language = "en"))
# Lower-case, remove punctuation, strip extra-whitespace, remove numbers:
corpus_prep <- tm_map(corpus_raw, content_transformer(tolower))
corpus_prep <- tm_map(corpus_prep, removePunctuation)
corpus_prep <- tm_map(corpus_prep, 
                      content_transformer(function(x) gsub("[[:punct:]]",
                                                           " ", x)))
corpus_prep <- tm_map(corpus_prep, stripWhitespace)
corpus_prep <- tm_map(corpus_prep, removeNumbers)

# Profanity filter:
replaceProfanity <- content_transformer(function(x, bad_vect, repl) {
    for (i in 1 : length(bad_vect)) {
        x <- gsub(paste0("\\b", bad_vect[i], "\\b"), repl, x)
    }
    return(x)
})

prof.con <- file("bad-words.txt")
profanity <- readLines(prof.con)
close(prof.con)
special_word <- "<PROF>"
corpus_prep <- tm_map(corpus_prep, FUN = replaceProfanity, 
                          bad_vect = profanity, repl = special_word)
rm(profanity)
# Pick the text located at 60% of the vector's length (no particular reason for choosing this value!):
corpus_raw[[round(length(corpus_raw) * 0.6)]]$content
# Pick the text located at 60% of the vector's length (no particular reason for choosing this value!):
corpus_prep[[round(length(corpus_raw) * 0.6)]]$content
# Some housekeeping to free up memory before computing the TDMs:
gc()

# Unigram Term-Document Matrix
term_doc_uni <- TermDocumentMatrix(corpus_prep)

# Digram tokenizer function:
DiGramTokenizer <- function(text) {
    unlist(lapply(ngrams(words(text), 2), paste, collapse = " "), use.names = FALSE)
}

term_doc_di <- TermDocumentMatrix(corpus_prep, 
                                  control = list(tokenize = DiGramTokenizer))

# Trigram tokenizer function:
TriGramTokenizer <- function(text) {
    unlist(lapply(ngrams(words(text), 3), paste, collapse = " "), use.names = FALSE)
}

term_doc_tri <- TermDocumentMatrix(corpus_prep, 
                                   control = list(tokenize = TriGramTokenizer))

# Let's look at the TDMs:
term_doc_uni
term_doc_di
term_doc_tri
term_doc_uni <- removeSparseTerms(term_doc_uni, sparse = 1. - 5./ncol(term_doc_uni))
term_doc_di <- removeSparseTerms(term_doc_di, sparse = 1. - 5./ncol(term_doc_di))
term_doc_tri <- removeSparseTerms(term_doc_tri, sparse = 1. - 5./ncol(term_doc_tri))
# Let's look at the TDMs:
term_doc_uni
term_doc_di
term_doc_tri
# Visualise word distributions:
unigram_count <- sort(row_sums(term_doc_uni, na.rm = TRUE), decreasing = TRUE)  
    # Note the use of slam::row_sums as base::rowSums throws an error due to the
    #matrix size.
plot(unigram_count, log = "xy", main = "Unigrams ranked by count")
head(unigram_count, 20)

digram_count <- sort(row_sums(term_doc_di, na.rm = TRUE), decreasing = TRUE)
plot(digram_count, log = "xy", main = "Digrams ranked by count")
head(digram_count, 20)

trigram_count <- sort(row_sums(term_doc_tri, na.rm = TRUE), decreasing = TRUE)
plot(trigram_count, log = "xy", main = "Trigrams ranked by count")
head(trigram_count, 20)