This report documents the first peer-reviewed checkpoint for my Data Science capstone project, part of the Coursera certification.
The project is to design and build a predictive typing web application. Large datasets representative of blogs, news, and Twitter feeds have been provided. The application will accept free-form text, a keystroke at a time, and attempt to complete the current word-in-progress, or guess the next word that the writer intends.
The most important finding of this exploratory analysis is that, while a large volume of data was provided on which to base any predictive model or algorithm, the N-Gram profiles of the different datasets are very similar, even at relatively small subsets of the data. For example, the profiles using 0.1% of the data were almost identical to the profiles using 10 times that volume (1%) of the data. Only one run - using 1.0% of the data analyzed - is presented below, but this finding is likely to save a ton of time during future development.
For example, profiling 0.1% of the data took about 1 minute of compute time, and 1% took 8 minutes. Other timing data is shown below.
plot(c(0.1,0.5,1.0,1.5,2.0),c(1,3,8,14.5,27),type="l",
xlab="Percent of Data Analyzed",
ylab="Compute Time (minutes)",
main="Computational Complexity of the Exploratory Analysis")
Since the time complexity of the algorithm appears to be worse than linear, it may not be suitable for much larger portions of the data. Fortunately, analyzing more data may not be necessary.
In each run, regardless how much of the data was analyzed, it was consistent that only about 1000 unique words accounted for almost 75% of the words actually used in the writings. The most common bi- and tri-grams did shuffle up and down, but were also remarkably consistent.
Qualitatively, in all cases, roughly only 1000 of the over 30,000 unique words account for nearly 75% of the words used by the writers. This suggests that perhaps, the prediction algorithm may perform adequately when only trying to guess the word following these 1000 most frequently-used words.
It is interesting that blogs and news had very similar most commonly used words. Tweets use somewhat different words most frequently.
This report is composed of the following sections:
During this exploratory phase of the project, I am computing and analyzing the relative frequencies of words (unigrams), bigrams, and trigrams of each dataset. I am also building up a library of processing functions. They are included in the body of this document for now, but will likely be moved to a library in future development phases.
My plan is to build a Shiny app, which may have a mode/style selection box indicating whether the user is typing a blog/news or a tweet. One other long text field will be the input buffer. As the user types, between 3 and 5 suggestion fields/buttons will display most-likely next word choices. Clicking on a choice will copy the suggestion into the input buffer and allow the user to continue typing.
I will probably include a hidable diagnostic panel in the app to keep track of performance metrics, like how many words have been typed, how many times a word was selected (before the next space or punctuation), how many times selected words were the first/second/third/etc. choice, and how many keystrokes were saved by picking a suggestion.
First, we are trying to predict, based on limited prior-word context and partial words, which (correctly spelled) word the user is attempting to type next. Therefore, it makes sense to build a dictionary with a significant subset of correct words, chosen from the training data. The dictionary should also include the counts and/or relative frequencies of individual words, and word 2-grams, 3-grams, etc.
The training data comes in 3 types: blogs, news, and tweets. It seems intuitively likely that users may prefer to use different syntax, terms, and grammar in each case. For example, a news story should be realatively formal. A blog may take the form of an essay. And, a tweet, being length-limited, may be more colloquial and abbreviated. The syntax and semantics may vary with each style. Therefore, it may make sense to support several modes (classifiers). At least, the three datasets should be explored separately to compare their characteristics. It may turn out that the application should include multiple classifiers which are used to get better performance for different styles of writing.
There are a lot of options for pre-processing and normalizing the text, but it seems that some operations are more appropriate than others. Possible training data pre-processing operations include:
Remove punctuation - Since punctuation is very dependent on grammar, and is out of scope for this project, I will start by ignoring all punctuation during model building. I may add processing of periods back in later as they may have meaning with respect ot the process of predicting what’s next.
Remove numbers - Good idea.
Convert to lower case - Good idea.
Remove low-value words - In this case, we are trying to guess the next word desired. We are not trying to glean the meaning of what is being typed or improve the writer’s grammar/style. If a user might wish to type “the” or “and”, such words should not be removed from the training set.
Stem words (remove common prefixes and suffixes) - Again, since we are not analyzing the text for meaning, but are trying to match as many words as possible, it seams that the training sets should not be stemmed.
Strip extraneous white space - Good idea, if only to prune memory use and improve efficiency.
Correct misspelled words - Words could be “misspelled” for a number of reasons. One could represent a mistake by the original writer. Or it could be non-english characters. If there is a dictionary of legitimate words, it would be good to filter out training words that do not appear in the dictionary. However, I am not aware of any such dictionary, so will probably just treat misspellings as legitimate words.
Remove foreign words and phrases - Same discussion as for misspelled words applies here too.
Remove profanity - Since users may desire to use naughty words, I am not inclined to try and judge naughtiness. So, the same discussions as for low-value, misspelled, and foreign words apply.
The modeling process is very compute-intensive, but does run on my computer. For now, I am using basic R functions that, I am sure, do not make for an efficient analysis algorithm. Building the model does not have critical time requirements. But after the model is built, the app will have to be responsive to typing. With each keystroke, the algorithm will have to compute the 3-5 best suggestions, so that typing is not impeded. And it has to allow for network latency of each key stroke.
I have a graph-based, deterministic, algorithm in mind that should be very fast to look up suggestions. The trick will be in the data structures. At this time, I don’t know R well enough to build the structures I need for that algorithm. So, I’m going to have to investigate. Hopefully, I’ll either come across another workable algorithm or some good data structure examples written in R.
Still, when I first tried to process the full datasets, I found that the algorithms below ran for 24 hours without producting results. Wondering how important it is to use the full datasets, I started experimenting with a fraction of each file. I found that the N-gram analyses produced nearly the same results, even with as little as 0.1 to 1.0 percent of the full data files. And, needless to say, the building process completed in reasonable time periods. More analysis will be required to determine the minimum amount of data that should be analyzed in order to produce a very effective prediction algorithm.
This is still the early phases of planning, but I am reading many articles, and may revisit the current work as warranted.
Initialize the computing environment. Load libraries and establish folder/file structure.
library(R.utils)
library(NLP)
library(tm)
setwd("~/Academic/DataScience/Capstone/expl") # Where is R running?
datafolder <- "../data/en_US" # Where is the data?
timings <- FALSE
set.seed(8675309) # Set randomization so result is repeatable.
data.fraction <- 0.01 # What fraction of the test data should be processed?
Below are several library routines (that will be moved in a later phase of the project) for processing the data described throughout this document.
Note: The following algorithms were first tested on a tiny, 7-line data file. Those results also appear at the end of this report in Appendix B.
Scan a data file and count the number of lines, words, and characters.
basic.data.stats <- function(filename,dir="../data/en_US") {
pathname <- paste(dir,filename,sep="/")
data <- readLines(pathname,warn=FALSE,skipNul=TRUE)
stats <- c(length(data),sum(sapply(strsplit(data,"\\s+"),length)),sum(sapply(data,nchar)))
data.frame(lines=stats[1],words=stats[2],characters=stats[3],row.names=filename)
}
Same input/output as readLines, but returns only a fraction (p) of the lines in the file.
readSamples <- function(path,p=0.5) {
sample <- vector()
line <- character()
f <- file(path,"rb")
while(TRUE) {
line <- readLines(f,n=1,warn=FALSE,skipNul=TRUE)
if (length(line) > 0) {
if (runif(1) <= p) { # Randomly include/discard this line?
sample <- c(sample,line) # Concatenate. (Is this efficient?)
}
}
else {
break
}
}
close(f)
sample
}
Basic clean-up as discussed above.
scrub.data.file <- function(path) { # perform basic scrubbing per discussion above
data <- readSamples(path,data.fraction) # read raw text; won't scale for large files
data <- removePunctuation(data) # remove punctuation
data <- removeNumbers(data) # remove numbers
data <- tolower(data) # lower case
print(paste0(path," - ",toString(length(data))," lines, ",
toString(length(unlist(strsplit(data,"\\s+"))))," words"))
# data <- removeWords(data,stopwords("english")) # remove english words that have no analytical value
# data <- removeWords(data,blacklist.words) # remove profanity or other words that don't belongh
# data <- stemDocument(data) # stem words (remove common suffixes and prefixes)
# data <- stripWhitespace(data) # remove whitespace characters (done later, not necessary)
unlist(data) # return simple vector
}
Convert a vector of words to a vector of N-grams (represented as non-unique strings).
gramify <- function(n,words) { # convert a vector of words into a vector of N-grams
if ((1 < n) && (n <= length(words))) { # N must be between 2 and the length of the input vector
grams <- vector("character",length(words) - (n-1))
for (i in 1:(length(grams))) {
gram <- words[i]
for (j in 1:(n-1)) {
gram <- paste(gram,words[i+j],sep=" ")
}
grams[i] <- gram
}
grams
}
}
Convert list of documents into vector of unique N-grams with occurence counts.
count.ngrams <- function(n,docs) {
begin.time <- proc.time()
grams <- vector("character")
for (i in 1:length(docs)) { # Build entire list of n-gram, including duplicates
g <- gramify(n,docs[[i]])
grams <- c(grams,g) # Efficient?
}
result <- sort(table(sort(grams)),decreasing=TRUE)
elapsed.time <- proc.time() - begin.time
if (timings) {
print(paste0("count.ngrams(",toString(n),"): "))
print(elapsed.time)
}
result
}
Plot counts of the uni-/bi-/tri-grams.
hist.buckets <- 25 # Width of histograms
plot.ngrams <- function (f,g1,g2,g3) {
par(mfrow=c(2,2),mai=c(0.8466,0.6806,0.6806,0.3486))
plot(g1,main="Unigram (Word) Counts",las=2,cex.axis=0.75,xlim=range(1:hist.buckets),ylab="Count")
num.words <- length(g1) # Number of unique words encountered
plot.data <- cumsum(g1) # Cumulative counts of words
n <- plot.data[length(plot.data)] # Total number of (non-unique) words
plot(plot.data,main="Cumulative Word Count",las=2,ylab="Count",ylim=c(1,n),xlab="",xaxt='n')
abline(h=n * .25,lty=2)
abline(h=n * .50)
abline(h=n * .75,lty=2)
if (num.words > 1000) {
abline(v=1000,lty=3)
}
if (num.words <= 50) {
axis(1,las=2,labels=names(g1)[1:num.words],at=(1:num.words))
}
else {
axis(1,las=2)
}
plot(g2,main="Bigram Counts",las=2,cex.axis=0.75,xlim=range(1:hist.buckets),ylab="Count")
plot(g3,main="Trigram Counts",las=2,cex.axis=0.75,xlim=range(1:hist.buckets),ylab="Count")
}
Read a data file, compute and plot the N-Grams.
analyze.ngrams <- function(filename) {
begin.time <- proc.time()
f<-paste(datafolder,filename,sep="/") # Read file into memory. Won't scale.
data <- scrub.data.file(f) # Basic cleanup and normalization
docs <- strsplit(data, "\\s+")
unigrams <- unlist(docs) # Calculate N-grams
unigrams.count <- sort(table(sort(unigrams)),decreasing=TRUE)
bigrams.count <- count.ngrams(2,docs)
trigrams.count <- count.ngrams(3,docs)
plot.ngrams(f,unigrams.count,bigrams.count,trigrams.count)
elapsed.time <- proc.time() - begin.time
if (timings) {
print(paste0("analyze.ngram(",filename,"): "))
print(elapsed.time)
}
# gc() # Garbage collection
}
First, just to get a basic idea of dataset sizes, let’s get some basic size information for each dataset.
flist <- list.files(path=datafolder, recursive=TRUE, pattern=".*en_.*.txt")
stats <- data.frame()
for(f in flist) {
stats <- rbind(stats,basic.data.stats(f))
}
stats
## lines words characters
## en_US.blogs.txt 899288 37334441 208361438
## en_US.news.txt 77259 2643972 15683765
## en_US.twitter.txt 2360148 30373832 162385035
The fraction of test data, and line/word counts are listed below. Histograms of the word (unigram), bigram, and trigram frequencies are plotted. The cumulative word count chart illustrates that relatively few (about 1000 in each case) unique words account for about 75% of words appearing in the original writings, which included more than 30,000 words.
This finding suggests that perhaps, the prediction algorithm could perform adequately when only trying to guess the word following these 1000 words. If so, that would likely significantly improve the typing performance of the predictor. But, more analysis will be required.
It is interesting that blogs and news had very similar most commonly used words. Tweets use different words most frequently.
date()
## [1] "Sun Sep 04 16:43:55 2016"
data.fraction
## [1] 0.01
analyze.ngrams("en_US.blogs.txt")
## [1] "../data/en_US/en_US.blogs.txt - 9042 lines, 372449 words"
date()
## [1] "Sun Sep 04 16:45:57 2016"
data.fraction
## [1] 0.01
analyze.ngrams("en_US.news.txt")
## [1] "../data/en_US/en_US.news.txt - 10211 lines, 343325 words"
date()
## [1] "Sun Sep 04 16:47:50 2016"
data.fraction
## [1] 0.01
analyze.ngrams("en_US.twitter.txt")
## [1] "../data/en_US/en_US.twitter.txt - 23394 lines, 290732 words"
date()
## [1] "Sun Sep 04 16:51:10 2016"
The following software environment was used to produce this analysis and report. It is included here only to assist in reproduction of these findings. The computer used had 4 cores and 16 GB of RAM.
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tm_0.6-2 NLP_0.1-9 R.utils_2.3.0 R.oo_1.20.0
## [5] R.methodsS3_1.7.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 digest_0.6.9 slam_0.1-38 formatR_1.4
## [5] magrittr_1.5 evaluate_0.9 stringi_1.1.1 rmarkdown_1.0
## [9] tools_3.3.1 stringr_1.0.0 yaml_2.1.13 parallel_3.3.1
## [13] htmltools_0.3.5 knitr_1.13
date()
## [1] "Sun Sep 04 16:51:11 2016"
data.fraction <- 1.0
analyze.ngrams("tiny_data.txt")
## [1] "../data/en_US/tiny_data.txt - 7 lines, 33 words"