With the rapid spread of mobile devices that increasingly serve as commdication hubs, it becomes imperative to develop tools that allow users to write text quickly while on the move. Predictive text has been around since the early days of mobile phones but were initially just spellchek and word completion applications.
Modern computing power and techniques now allow a much cleverer approach that can predict not only letters, but also the next word given the first couple of words in a phrase.
The Capstone Project to the JHU Data Science Specialization on Coursera consists of writing such an application, starting from three large corpora of text taken from news, blogs and Twitter.
In this milestone document, I will report my initial findings about the nature and distribution of these data. Most of the work is done using the tm package by Ingo Feinerer & Kurt Hornik.
We are provided with copora in English, German, Finnish and Russian. However in the context of this project, I am only going to use the English copora. Within these, we have text collected from blogs, online news and Twitter. After loading the data, I look at the number of text documents in each corpus:
## Blogs News Twitter
## Num.lines 899288 1010242 2360148
## Max.num.char.per.line 40835 11384 213
## Num.words.per.file 44729614 41076715 37108157
In further steps, all three are combined into one large corpus. Given the size of this corpus, and at least for the exploratory and prototyping phases, I am going to use only a small random sample of the data. I use 1%, which will make computation times and memory requirements much more reasonable.
We now have a complete corpus of texts. In order to make it more usable, some data manipulation is required. I apply the following transformations:
Note: I decided against stemming words and removing stop words. These two steps are useful when trying to classify texts by subject for instance, but in the present case the task is to predict the next word in a phrase so that the phrase makes sense. In many cases, the next appropriate word will be a stop word. Even if it is not, we want the application to suggest a full word, not just a stem.
The last pre-processing step is to remove profanity words. It is assumed that most users would not expect these to be suggested by the application, especially as children could be among the users. Any word from the corpus found within a list of 1,300 profanity words is replaced with <PROF>.
Note: The list used includes words that can be used as profanity in certain contexts, but would be perfectly acceptable in others. This will result in a certain loss of accuracy in the model. In later phases of the project, we may have to review this list manually or find a less stringent one.
Here is an sample of text from our corpus before pre-processing:
## [1] "Yo Bro... That Original was sweet!!!!! Your Falsetto is Money!!!"
And after pre-processing:
## [1] "yo bro that original was sweet your falsetto is money"
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6599102 352.5 9968622 532.4 9968622 532.4
## Vcells 93763442 715.4 267520299 2041.1 266058899 2029.9
In order to study the distribution of words in the corpus, we take advantage of the TermDocMatrix() function in the tm package in R, which will parse the texts into words. It also offers the possibility of parsing N-grams so we will also build digram and trigram matrices.
The data summary for each shows that they are very sparse:
# Let's look at the TDMs:
term_doc_uni
## <<TermDocumentMatrix (terms: 55379, documents: 42821)>>
## Non-/sparse entries: 692251/2370691908
## Sparsity : 100%
## Maximal term length: 57
## Weighting : term frequency (tf)
term_doc_di
## <<TermDocumentMatrix (terms: 435962, documents: 42821)>>
## Non-/sparse entries: 947340/18667381462
## Sparsity : 100%
## Maximal term length: 68
## Weighting : term frequency (tf)
term_doc_tri
## <<TermDocumentMatrix (terms: 771694, documents: 42821)>>
## Non-/sparse entries: 919833/33043788941
## Sparsity : 100%
## Maximal term length: 84
## Weighting : term frequency (tf)
We need to reduce their sparsity in order to be able to process them on a personal computer but the matrix content is very sensitive to the value of the threshold parameter. The number of different symbols (words / N-grams) drops very rapidly with the threshold. I therefore decided to remove words that appear less than five times.
Let us look at the summaries again:
# Let's look at the TDMs:
term_doc_uni
## <<TermDocumentMatrix (terms: 10246, documents: 42821)>>
## Non-/sparse entries: 621339/438122627
## Sparsity : 100%
## Maximal term length: 16
## Weighting : term frequency (tf)
term_doc_di
## <<TermDocumentMatrix (terms: 18652, documents: 42821)>>
## Non-/sparse entries: 416973/798280319
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)
term_doc_tri
## <<TermDocumentMatrix (terms: 6317, documents: 42821)>>
## Non-/sparse entries: 78344/270421913
## Sparsity : 100%
## Maximal term length: 28
## Weighting : term frequency (tf)
This pruning had maximum effect on the trigram matrix.
I then extract the counts or each word in each TDM and rank them in descending order and plot the graph in logarithmic scale for both x and y. I also print out the 20 most frequent symbols:
## the and for that <prof> you with was this have
## 48091 24216 11132 10707 9911 9461 7161 6448 5469 5332
## are but not from they all will its said his
## 4998 4826 4260 3802 3400 3322 3125 3112 3100 3089
## of the in the to the for the on the to be
## 4461 4180 2123 2039 1895 1658
## at the and the in a with the is a the <prof>
## 1482 1236 1216 1083 1064 1054
## for a it was from the i was i have it is
## 961 954 903 890 866 839
## with a and i
## 836 824
## one of the a lot of thanks for the going to be to be a
## 351 294 254 182 178
## it was a the end of some of the i want to as well as
## 155 153 151 147 145
## out of the be able to i don t part of the i have a
## 141 137 137 135 133
## of the <prof> i have to a couple of is going to this is a
## 123 114 110 110 110
As expected, “stopwords” represent the vast majority of the 20 most frequent words/N-grams. They do not carry much information but in the context of this application, are still required to form grammatically correct sentences.
The three models demonstrate a typical “Zipfian” distribution were the log of each symbol’s rank is roughly proportional to the log of its count. This means that there is a few frequent words or N-greams and a long tail of rare ones. We can also see the effects of the profanity filter, where the string <PROF> appears in an artificially high number of documents as it stands for numerous other words.
Even with a relatively limited sample from the corpus, some operations are computationally expensive and nearly saturate the memory of an 8GB PC running Windows 10 (in particular building the term-document matrices).
In the final product, we will want to use as much data as possible to train the model and this will require some technique for storing the data on disk rather than in memory. Speed of training is not critical except for the developper. The end-user however will expect immediate responses from the application so speed of execution is vital.
The next steps will be to develop a Markov chain model and train it to make accurate predictions. To achieve this, as explained above, we will need to develop ways of using the very large dataset that is provided, or as much of it as possible – keeping in mind that part of it will be required as validation and test sets. To ensure quick reaction times from the application, we are considering building hash tables with the N-grams as keys and a frequency metric as values. This metric might be frequency, smoothed frequency (to account for unknown words) or some other number.
# Load required libraries
require(readr); require(slam)
require(rJava); require(openNLP)
require(tm); require(stringi)
require(filehash); require(SnowballC); require(hash)
if("Rgraphviz" %in% rownames(installed.packages()) == FALSE) {
source("http://bioconductor.org/biocLite.R")
biocLite("Rgraphviz")
}
require(Rgraphviz)
set.seed(1234)
blogs.path <- "final/en_US/en_US.blogs.txt"
twitter.path <- "final/en_US/en_US.twitter.txt"
news.path <- "final/en_US/en_US.news.txt"
blogs.con <- file(blogs.path, "rb")
news.con <- file(news.path, "rb")
twitter.con <- file(twitter.path, "rb")
blogs.rawdata <- iconv(readLines(blogs.con, encoding = "UTF-8",
skipNul = TRUE), "UTF-8", "ascii", sub = " ")
news.rawdata <- iconv(readLines(news.con, encoding = "UTF-8",
skipNul = TRUE), "UTF-8", "ascii", sub = " ")
twitter.rawdata <- iconv(readLines(twitter.con, encoding = "UTF-8",
skipNul = TRUE), "UTF-8", "ascii", sub = " ")
sourceVect <- c(blogs.rawdata, news.rawdata, twitter.rawdata)
close(blogs.con); close(news.con); close(twitter.con)
# Basic statistics about these data:
rawfiles <- list(blogs.rawdata, news.rawdata, twitter.rawdata)
files_summary <- sapply(rawfiles, function(x) {
x <- unname(x)
c1 <- length(x)
c2 <- max(nchar(x))
c3 <- length(unlist(strsplit(x, "([[:punct:]]|\\s)")))
cbind(c1, c2, c3)
})
files_summary <- as.data.frame(files_summary, row.names = c("Num.lines", "Max.num.char.per.line", "Num.words.per.file"))
colnames(files_summary) <- c("Blogs", "News", "Twitter")
files_summary
# Take a random sample of the data and write it to file:
N <- length(sourceVect)
sourceSample <- sourceVect[rbinom(N, 1, 0.01) == 1]
writeLines(sourceSample, con = "sample/text_samples.txt", sep = "\n")
rm(blogs.rawdata)
rm(news.rawdata)
rm(twitter.rawdata)
rm(sourceVect)
# Build corpus:
corpus_raw <- VCorpus(VectorSource(sourceSample),
readerControl = list(language = "en"))
# Lower-case, remove punctuation, strip extra-whitespace, remove numbers:
corpus_prep <- tm_map(corpus_raw, content_transformer(tolower))
corpus_prep <- tm_map(corpus_prep, removePunctuation)
corpus_prep <- tm_map(corpus_prep,
content_transformer(function(x) gsub("[[:punct:]]",
" ", x)))
corpus_prep <- tm_map(corpus_prep, stripWhitespace)
corpus_prep <- tm_map(corpus_prep, removeNumbers)
# Profanity filter:
replaceProfanity <- content_transformer(function(x, bad_vect, repl) {
for (i in 1 : length(bad_vect)) {
x <- gsub(paste0("\\b", bad_vect[i], "\\b"), repl, x)
}
return(x)
})
prof.con <- file("bad-words.txt")
profanity <- readLines(prof.con)
close(prof.con)
special_word <- "<PROF>"
corpus_prep <- tm_map(corpus_prep, FUN = replaceProfanity,
bad_vect = profanity, repl = special_word)
rm(profanity)
# Pick the text located at 60% of the vector's length (no particular reason for choosing this value!):
corpus_raw[[round(length(corpus_raw) * 0.6)]]$content
# Pick the text located at 60% of the vector's length (no particular reason for choosing this value!):
corpus_prep[[round(length(corpus_raw) * 0.6)]]$content
# Some housekeeping to free up memory before computing the TDMs:
gc()
# Unigram Term-Document Matrix
term_doc_uni <- TermDocumentMatrix(corpus_prep)
# Digram tokenizer function:
DiGramTokenizer <- function(text) {
unlist(lapply(ngrams(words(text), 2), paste, collapse = " "), use.names = FALSE)
}
term_doc_di <- TermDocumentMatrix(corpus_prep,
control = list(tokenize = DiGramTokenizer))
# Trigram tokenizer function:
TriGramTokenizer <- function(text) {
unlist(lapply(ngrams(words(text), 3), paste, collapse = " "), use.names = FALSE)
}
term_doc_tri <- TermDocumentMatrix(corpus_prep,
control = list(tokenize = TriGramTokenizer))
# Let's look at the TDMs:
term_doc_uni
term_doc_di
term_doc_tri
term_doc_uni <- removeSparseTerms(term_doc_uni, sparse = 1. - 5./ncol(term_doc_uni))
term_doc_di <- removeSparseTerms(term_doc_di, sparse = 1. - 5./ncol(term_doc_di))
term_doc_tri <- removeSparseTerms(term_doc_tri, sparse = 1. - 5./ncol(term_doc_tri))
# Let's look at the TDMs:
term_doc_uni
term_doc_di
term_doc_tri
# Visualise word distributions:
unigram_count <- sort(row_sums(term_doc_uni, na.rm = TRUE), decreasing = TRUE)
# Note the use of slam::row_sums as base::rowSums throws an error due to the
#matrix size.
plot(unigram_count, log = "xy", main = "Unigrams ranked by count")
head(unigram_count, 20)
digram_count <- sort(row_sums(term_doc_di, na.rm = TRUE), decreasing = TRUE)
plot(digram_count, log = "xy", main = "Digrams ranked by count")
head(digram_count, 20)
trigram_count <- sort(row_sums(term_doc_tri, na.rm = TRUE), decreasing = TRUE)
plot(trigram_count, log = "xy", main = "Trigrams ranked by count")
head(trigram_count, 20)