Natural language processing has been a major area of research over the last 20 years. Though the promise is great, there have been few meaningful successes in the field. One that has affected everybody’s life is the completion of search queries. Notably, Google seems to know what we want when we have only typed a fraction of the query we planned to. How is this done? What modelling needs to take place to make this efficient and quick? This report will serve as an introduction to my efforts to tackle this problem.
We have been provided a corpus of language from three sources: blogs, news and twitter. Each of these files was downloaded in compressed form from Cloudfront. From there, the files were decompressed and combined to be processed. That code is shown in the appendix. The data set is a large one. Importantly, after the initial processing was complete, RDS files were saved to ensure the processsing time was not wasted.
Based on other work, it would appear that the average length of an english language word is 5.1 letters. Given that our corpus includes Twitter, famous for its limitation of 140 characters, we’d expect something less than that, despite needing spaces between the characters. This suggests we could get somewhere in the range of 88 million words. In fact, we get ~103 million. Overall, there are 4 million rows in the dataset that was processed. Code for the processing is also included in the appendix.
After initial processing, we have a data table with information about the word before it and the previous two words (if any). This was further processed to three data.tables that the following graphs and discussion will cover. The processing is included in the appendix and is mostly concerned with getting word counts.
With these three separate processed data tables, the top ten entries only account for ~20%, 2% or 0.2% for a single word, two word groups or three word groups, with roughly an order of magnitude difference between them.
Also, all three are quite concentrated, but the triple word grouping is less so. To represent these, let’s look at the cumulative distribution functions of each, and histograms of the top twenty most common occurrences. The data are so large, that to graph in a reasonable amount of time, they were distilled into something smaller (see the appendix for more information).
So there seems to be an inflection point where the corpus sees just one instance of all of the remaining entries (and the cumulative graph turns into a straight line).
There are 1,159,297 single word instances (lowercase, with some punctuation that we haven’t stripped out) in our dataset. They are very concentrated: the top word ‘the’ accounts for almost 4.5% of all word counts. The top single word groups are not surprising and are shown in the following histigram:
Note that our results here are a bit different from results if you were to reduce the words to lexemes. You can view this by visiting Wikipedia’s entry on the most common words in English. That said for our model, we may choose to keep words separate (ignore lexemes) for simplicity in the model.
There are 16,452,557 two-word instances in our dataset. They are less concentrated than the single word instances, but this is not surprising given the large number of combinations that you can use. This is far less than the 672 billion two-word combinations that are possible to generate from the single word set (meaning many, many two-word combinations are very unlikely to occur, only 0.00245% percent). The top double word groups are also not surprising (lots of combinations with the most common word) and are shown in the following histigram:
There are 48,631,839 three-word instances in our dataset. They are even less concentrated than the double word instances, with about 80% of the combinations just showing up one time. Again, this is far less than the 260 quadrillion three-word combinations that are possible to generate from the single word set. The top triple word groups are less intuitive, but not surprising and are shown in the following histigram:
After having looked at the data, there is one key feature that stands out: lots of data. Therefore, figuring out how to make our model compact is going to be an issue. A Shiny server does not have unlimited resources, and it is foolish to think that we can get good performance from an overly large model. That said, our initial data processing, though a bit brute-force, did process everything and puts us in a position of deliberately paring down where necessary.
The approach that we will be taking is one where we build a model (based on the Maximum Likelihood estimate and Markov chains). We will have to use smoothing (i.e., pad the frequency of low or unobserved word combinations), to account for combinations or single word instances that we have just not seen yet.
Finally, we should assess whether we should do more cleaning of the text. So far, we have made a conscious choice not to, hoping that we can build a useful model that could include some of the real-world quirks that show up in twitter feeds, smily faces and all.
We have reviewed the initial work done to be able to tackle the Coursera Data Science Capstone project. This work has demonstrated that we have loaded the data and manipulated it to get an idea of what is there and where we need to go from here to successfully complete the project. More study, however, is necessary before the project can come to fruition.
The data was loaded from the three files and then each line was processed individually:
# Note that data is already downloaded from externally hosted website and unzipped
d1 <- readLines(con = file('D:\\Downloads\\CourseraData\\en_US\\en_US.blogs.txt'))
d2 <- readLines(con = file('D:\\Downloads\\CourseraData\\en_US\\en_US.twitter.txt'))
d3 <- readLines(con = file('D:\\Downloads\\CourseraData\\en_US\\en_US.news.txt'))
# Combine all three files
d <- c(d1, d2, d3)
# eliminate variables to save memory
rm(d1, d2, d3)
# To summarize
totalWords <- round(sum(nchar(d))/6.5, 0)
totalRows <- length(d)
lines <- 1:length(d) # lines <- 1:10000 to test
words <- rbindlist(mclapply.hack(lines, function(line) {
# separate out the line into separate words, using a simple breaking pattern
words <- tolower(strsplit(d[line], "[ .,!-]{1,5}")[[1]])
words <- words[nchar(words) > 0]
words <- gsub("'", "", words)
## depending on the line length, return a data table
if(length(words) > 2) {
return(data.table(word = words,
wordPrev = c(NA, words[1:(length(words)-1)]),
wordPrev2 = c(NA, NA, words[1:(length(words)-2)])))
} else if(length(words) > 1) {
return(data.table(word = words,
wordPrev = c(NA, words[1]),
wordPrev2 = c(NA, NA)))
} else {
return(data.table(word = words,
wordPrev = NA,
wordPrev2 = NA))
}
}))
words[, twoWords := paste(wordPrev, word)]
words[is.na(wordPrev), twoWords := NA]
words[, threeWords := paste(wordPrev2, wordPrev, word)]
words[is.na(wordPrev) | is.na(wordPrev2), threeWords := NA]
words
saveRDS(words, 'D:\\words.RDS')
single <- words[, .(.N), by = word][order(-N)]
single[, cumCount := cumsum(N)]
single[, perCum := cumCount / sum(single$N)]
single[, rowNum := .I]
perc50 <- max(single[perCum >= 0.5]$N)
double <- words[!is.na(twoWords), .(.N), by = twoWords][order(-N)]
double[, cumCount := cumsum(N)]
double[, perCum := cumCount / sum(double$N)]
double[, rowNum := .I]
percd50 <- max(double[perCum >= 0.5]$N)
triple <- words[!is.na(threeWords), .(.N), by = threeWords][order(-N)]
triple[, cumCount := cumsum(N)]
triple[, perCum := cumCount / sum(triple$N)]
triple[, rowNum := .I]
perct50 <- max(triple[perCum >= 0.5]$N)
single <- data.table(readRDS('D:\\single.RDS'))
double <- data.table(readRDS('D:\\double.RDS'))
triple <- data.table(readRDS('D:\\triple.RDS'))
rowSingle <- nrow(single)
rowDouble <- nrow(double)
rowTriple <- nrow(triple)
vec.single <- c(1:20, (1:100)*round(rowSingle/100))
vec.double <- c(1:20, (1:100)*round(rowDouble/100))
vec.triple <- c(1:20, (1:100)*round(rowTriple/100))
dt.graphFull <- rbind(single[rowNum %in% vec.single, .(variable = 'single', perCum, rowNum)],
double[rowNum %in% vec.double, .(variable = 'double', perCum, rowNum)],
triple[rowNum %in% vec.triple, .(variable = 'triple', perCum, rowNum)])
## because Kntr doesn't accept long vectors, must store intermediate output
saveRDS(list(rowSingle=rowSingle, rowDouble=rowDouble, rowTriple=rowTriple,
graphFull=dt.graphFull, single=single[1:20],
double=double[1:20], triple=triple[1:20]), 'D:\\graphVars.RDS')
Because there was a substantial amount of data to process and this work was being done on a Windows-based computer, a “hack” had to be put in place to take advantage of all four of the cores. This code was used to accomplish that. Attribution is provided to the author.
## Define an mclapply-like function, written by Nathan VanHoudnos
## taken from https://www.r-bloggers.com/implementing-mclapply-on-windows-a-primer-on-embarrassingly-parallel-computation-on-multicore-systems-with-r/
## This will allow me to take advantage of 8 cores on my desktop computer
## and process the corpus in less time
mclapply.hack <- function(...) {
## Create a cluster
## ... How many workers do you need?
## ... N.B. list(...)[[1]] returns the first
## argument passed to the function. In
## this case it is the list to iterate over
size.of.list <- length(list(...)[[1]])
cl <- makeCluster( min(size.of.list, detectCores()) )
## Find out the names of the loaded packages
loaded.package.names <- c(
## Base packages
sessionInfo()$basePkgs,
## Additional packages
names( sessionInfo()$otherPkgs ))
## N.B. tryCatch() allows us to properly shut down the
## cluster if an error in our code halts execution
## of the function. For details see: help(tryCatch)
tryCatch( {
## Copy over all of the objects within scope to
## all clusters.
##
## The approach is as follows: Beginning with the
## current environment, copy over all objects within
## the environment to all clusters, and then repeat
## the process with the parent environment.
##
this.env <- environment()
while( identical( this.env, globalenv() ) == FALSE ) {
clusterExport(cl,
ls(all.names=TRUE, env=this.env),
envir=this.env)
this.env <- parent.env(environment())
}
## repeat for the global environment
clusterExport(cl,
ls(all.names=TRUE, env=globalenv()),
envir=globalenv())
## Load the libraries on all the clusters
## N.B. length(cl) returns the number of clusters
parLapply( cl, 1:length(cl), function(xx){
lapply(loaded.package.names, function(yy) {
## N.B. the character.only option of
## require() allows you to give the
## name of a package as a string.
require(yy , character.only=TRUE)})
})
## Run the lapply in parallel
return( parLapply( cl, ...) )
}, finally = {
## Stop the cluster
stopCluster(cl)
})
}