This document summarizes work done to construct, test and optimize a model for text prediction.
A body of sample texts consisting of ~4M documents including tweets, news articles and blog posts are loaded and exploratory analysis performed. Sets of n-grams are extracted from the body of text, predictive algorithms built, and various approaches for improving predictive accuracy refined.
A cursory analysis of the dataset was presented in the https://github.com/pchuck/coursera-ds-capstone/blob/master/milestone.md report.
This is the final capstone project for the Johns Hopkins data science specialization certification series. The corpus for the analysis is available at Capstone Dataset.
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
The English-language content is used for the analysis. 3 document sets are ingested.
Document-term matrices are created and n-grams ranging in sequence length from 1 to 5 words are created for the purpose of analyzing word frequencies and various characteristics of the dataset.
n-grams are extracted to characterize the frequency of multi-word clusters.
# sentence delimiters; prevent clustering across sentence boundaries
delimiters <- " \\t\\r\\n.!?,;\"()"
# n-gram tokenizers
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
TrigramTokenizer <- function(x, n) NGramTokenizer(x, Weka_control(min=3, max=3))
QuadgramTokenizer <- function(x, n) NGramTokenizer(x, Weka_control(min=4, max=4))
PentagramTokenizer <- function(x, n) NGramTokenizer(x, Weka_control(min=5, max=5))
gthreshold <- 15 # threshold for number of n-grams to display graphically
options(mc.cores=1) # limit cores to prevent rweka processing problems
ft.1 <- 10
dtm.1 <- DocumentTermMatrix(filtered.sub.np, control=list(minDocFreq=ft.1))
freq.1 <- sort(colSums(as.matrix(dtm.1)), decreasing=TRUE)
nf.1 <- data.frame(word=names(freq.1), freq=freq.1)
plotGram(gthreshold, freq.1, nf.1, "Word")
ft.2 <- 3
dtm.2 <- DocumentTermMatrix(filtered.sub, control=list(tokenize=BigramTokenizer, bounds=list(global=c(ft.2, Inf))))
freq.2 <- sort(col_sums(dtm.2, na.rm=T), decreasing=TRUE)
nf.2 <- data.frame(word=names(freq.2), freq=freq.2)
plotGram(gthreshold, freq.2, nf.2, "2-gram")
ft.3 <- 3
dtm.3 <- DocumentTermMatrix(filtered.sub, control=list(tokenize=TrigramTokenizer, bounds=list(global=c(ft.3, Inf))))
freq.3 <- sort(col_sums(dtm.3, na.rm=T), decreasing=TRUE)
nf.3 <- data.frame(word=names(freq.3), freq=freq.3)
plotGram(gthreshold, freq.3, nf.3, "3-gram")
ft.4 <- 2
dtm.4 <- DocumentTermMatrix(filtered.sub, control=list(tokenize=QuadgramTokenizer, bounds=list(global=c(ft.4, Inf))))
freq.4 <- sort(col_sums(dtm.4, na.rm=T), decreasing=TRUE)
nf.4 <- data.frame(word=names(freq.4), freq=freq.4)
plotGram(gthreshold, freq.4, nf.4, "4-gram")
ft.5 <- 2
dtm.5 <- DocumentTermMatrix(filtered.sub, control=list(tokenize=PentagramTokenizer, bounds=list(global=c(ft.5, Inf))))
freq.5 <- sort(col_sums(dtm.5, na.rm=T), decreasing=TRUE)
nf.5 <- data.frame(word=names(freq.5), freq=freq.5)
plotGram(gthreshold, freq.5, nf.5, "5-gram")
r <- 10 # frequency span for last-resort randomization
nf <- list("f1"=nf.1, "f2"=nf.2, "f3"=nf.3, "f4"=nf.4, "f5"=nf.5, "r"=r)
save(nf, file="data/nFreq.Rda") # save the ngram frequencies to disk
Generating the most common n-grams from even a subset (200K documents) of the full corpus can take several hours. Here, it is saved in a previous session and then loaded from disk:
load("data/nFreq-200000-10-3-3-2-2.Rda")
# return the number of entries with frequency exceeding count
countAboveFrequency <- function(nf, count) {
dim(nf[nf$freq > count, ])[1]
}
A word cloud can be used to show the most frequently occurring words and {2, 3, 4, 5}-grams.
Here are some simple tests to verify sane predictions for n-gram input phrases. The last word in each phrase is provided by the prediction function.
The following algorithm is applied for next word prediction
The following preprocessing steps were applied to create a set of n-grams that could be traversed in the search for a match with the input phrase.
The algorithm depends on the existence of a set of n-grams which is large enough to contain a good sampling of word combinations but small enough to be searched in a fraction of a second. The following optimizations were tested in the pursuit of finding a reasonable balance between accuracy and prediction speed. For each combination, accuracy, execution time and dataset size were recorded.
Through a series of iterations of exploratory analysis, refinements and testing, predictive accuracy was improved from 8% to 15% while maintaining a response time suitable for interactive use (<300ms) and producing a compressed and optimized dataset under 10MB in size.
As a final accuracy test, 1000 random phrases of varying length were extracted from the testing text set and the last word of each sequence excluded. The word prediction model was then invoked on each test phrase and the predicted word compared to the actual (excluded) word from the phrase.
The measured accuracy of the model (using only the 1st, top-ranked, response) is 14.19%.
The measured accuracy of the model (using top-5 ranked responses) is 21.63%.
The average speed of the algorithm is 238.0ms per word prediction.
A [text-predictor]((http://pchuck.shinyapps.io/text-predictor) application was developed to allow users to interact the prediction algorithm. The corpus preprocessing code and algorithms are linked below.