This is a milestone report for the Coursera Data Science Specialization capstone project. The capstone project involves developing an algorithm for predicting the next word a user will type based on the previous words she has typed. This sort of thing can be commonly seen in smart phone instant messaging applications.
Solving this problem will involve many of the skills taught in the Data Science Specialization such as data gathering, data cleansing, data exploration, R coding, statistical inference, machine learning and Web application development. Some basic guidance was provided, such as suggesting the use of the “tm” (text mining) package in R, but for the most part we will need to fire up our hacking skills and solve this problem on our own!
This report will demonstrate my understanding of the problem and the general approach of how I plan to solve it. The final completed project will be due 6 weeks from now.
A set of text files containing a total of over 700 million bytes of blog entries, twitter messages and news articles was provided. I will use this data to identify patterns of word use, frequencies of words and word combinations, in order to produce an algorithm that will have sufficent accuracy that a user will feel it improves the efficiency of their typing.
The project will involve following these steps:
Download the training data
Do a basic examination of the training data
Prepare a sampling from each training file large enough to be meaningful but small enough so exploratory activity can run quickly
Build a “corpus” from the sample files
Preprocess the corpus
Explore the corpus
Develop a fast efficient prediction algorithm
Develop a data storage schema that the algorithm can work with
Test & Refine
Publish
Present
Training data for this project was provided on the Web in a zip file from this organization: http://www.corpora.heliohost.org. Several languages are available including English, German, Finnish and Russian. I plan to use only the English data. The training data consists of three files.
Line counts for the files were determined using countLines from the R.utils package.
countLines("~/Documents/RProjects/DataScienceCapstone/final/en_US/en_US.blogs.txt")
Maximum line lengths in each file was determined using an awk command from my Mac’s Terminal console
awk ' { if ( length > x ) { x = length } }END{ print x }' ~/Documents/RProjects/DataScienceCapstone/final/en_US/en_US.blogs.txt
## FileName FileSize NumberOfLines MaximumLineLength
## 1 en_US.blogs.txt 210.2 MB 899288 40836
## 2 en_US.twitter.txt 167.1 MB 2360148 214
## 3 en_US.news.txt 205.8 MB 1010242 11385
The first line in the Blogs file looks like this:
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
Additional examination of the files involving word frequency counts, word associations, n-grams etc., is more easily done using tools specifically designed for text mining. Scripting langauges like Python, Perl and php could be used, but for purposes here I’ll focus on using R, the tm (text mining) package and the NLP (natural language processing) package.
My initial attempt to load and manipulate the three files in their original size failed because of extremely long processing times. After some experimentation with processing times for various sample sizes I decided to use 20,000 lines from each of the three files to do my initial exploration and modeling. At some point I’ll need to figure out how to use the entire training data sets.
The first step in being able to effectively manipulate and analyze text data using R is to load the documents into an entity called a “corpus”. A corpus acts like a database in that it is a container for document data and accompanying metadata about the documents. Once the documents are loaded into a corpus, other tm commands can be used to work with it. The code below combines the three smaller sample files located in the CorpusData directory into a corpus.
## Load r libraries needed for this work
library(tm) ## for text mining
library(R.utils) ## for countLines
library(reader) ## for n.readLines
library(RWeka) ## for n-grams
library(data.table) ## for setDT
library(ggplot2) ## for histogram
dirpath <- '~/Documents/RProjects/DataScienceCapstone/CorpusData'
theCorpus <- VCorpus(DirSource(dirpath, encoding = "UTF-8"), readerControl = list(reader = readPlain))
## Using Vcorpus (virtual corpus) will cause the corpus to be stored in memory for faster processing.
Once the documents are stored in a corpus the next step is to perform some basic clean up to prepare the data for further analysis. This is where we will remove numbers, puncuation, extra whitespace, etc. This demonstrates the power of the tm package since these preprocessing steps only require one line of code.
## Strip out whitespace
theCorpus <- tm_map(theCorpus, stripWhitespace)
## Convert to lower case
theCorpus <- tm_map(theCorpus, content_transformer(tolower))
## Do not remove stop words (most common words in the language) for now.
## We theorize having them will enhance the predictive model. We will test this when we actually develop our algorithm.
## theCorpus <- tm_map(theCorpus, removeWords, stopwords("english"))
## Prepare a list of profane words to remove from the corpus.
## Citation for list of profane words to remove: http://www.frontgatemedia.com/new/wp-content/uploads/2014/03/Terms-to-Block.csv
bw <- read.csv("~/Documents/RProjects/DataScienceCapstone/Terms-to-Block.csv")
bw2 <- as.vector(bw[4:726, 2])
bw3 <- gsub(',', '', bw2)
## Remove profanity
theCorpus <- tm_map(theCorpus, removeWords, bw3)
## Remove punctuation
theCorpus <- tm_map(theCorpus, removePunctuation)
## Remove numbers
theCorpus <- tm_map(theCorpus, removeNumbers)
## Do not perform stemming (reducing inflected, or sometimes derived, words to their word stem; eg running -> run) yet.
## We theorize having all forms of words will enhance the predictive model. We will test this later.
## theCorpus <- tm_map(theCorpus, stemDocument)
After preprocessing, the next step is to build a term document matrix (TDM). A TDM is a matrix that contains the count of every word in each document in the corpus. There will be a row for every word in the corpus, and a column for every document. Again, the tm package lets you do this in one simple step. Looking at the first 10 words in the TDM matrix you can see there are a lot of low frequency terms in the corpus that are non-english, contain non-printable characters, or are not real words.
tdm <- TermDocumentMatrix(theCorpus)
inspect(tdm[1:10,]) ## 1st 10 words. Matrix is ordered alphabetically
## <<TermDocumentMatrix (terms: 10, documents: 3)>>
## Non-/sparse entries: 11/19
## Sparsity : 63%
## Maximal term length: 13
## Weighting : term frequency (tf)
##
## Docs
## Terms BlogsSample.txt NewsSample.txt TwitterSample.txt
## 1 0 0
## 1 0 0
## ½in 1 0 0
## ½yearold 0 1 0
## aaa 1 8 0
## aaaa 0 1 0
## aaaaaand 1 0 0
## aaaaand 1 0 0
## aaaagggghhhh 1 0 0
## aaaandimpasse 1 0 0
Further processing with the removeSparseTerms command eliminates less frequent terms. Sparsity refers to the number of documents that a particular term appears within a corpus. This is an area where experimentation is needed to find a good “sparsity” value to use. I used 0.01 which means that I will only allow a maximum of 1% of the words in my corpus to not be in all three documents. After removing sparse terms we can see that the TDM I will work with further is much cleaner.
tdmrs <- removeSparseTerms(tdm, .01)
inspect(tdmrs[1:10,])
## <<TermDocumentMatrix (terms: 10, documents: 3)>>
## Non-/sparse entries: 30/0
## Sparsity : 0%
## Maximal term length: 9
## Weighting : term frequency (tf)
##
## Docs
## Terms BlogsSample.txt NewsSample.txt TwitterSample.txt
## aaron 14 25 6
## aarons 1 1 1
## abandoned 14 20 3
## abbey 6 4 6
## abby 9 1 1
## abc 9 17 3
## abcs 3 6 2
## ability 76 80 6
## able 305 194 50
## aboard 4 17 3
The following histogram shows counts of the 25 words that appear most frequently in the cleaned up sample corpus.
freq <- sort(rowSums(as.matrix(tdmrs)), decreasing=TRUE) ## calculate frequency of every word in the corpus
wf <- data.frame(word=names(freq), freq=freq) ## convert results to a data frame for plotting
wf2 <- wf[1:25,] ## Take just the 25 most frequent words
## Build the histogram
p <- ggplot(wf2, aes(x = reorder(word, -freq), y = freq))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p + labs(x = "Word", y = "Frequency")
The plot shows the most frequent words are so-called “stop words”. I will later investigate if removing these from the corpus can be useful in developing a more effective prediction algorithm.
My solution for this project will be based on the analysis of n-gram patterns in the data. Specifically I will be looking for 3 word sequences in the training data and applying some statistical tests to their occurance. The idea being, after the user has entered two words, there will be some probablistic rational for predicting the third word based on what was found in the training data. It is a fairly simplistic method but hopefully will give good results.
The following code generates the 3-gram matrix for the three sample training documents and eliminates “sparse” occurances. Sparsity refers to the number of documents that a particular n-gram appears within a corpus. This is an area where experimentation is needed to find a good “sparsity” value to use. I used 0.01 which means that I will only allow a maximum of 1% of the word triplets in my corpus to not be in all three documents.
## Citation: http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka
## Weka tokenizer fails on OSX if parallel processing is turned on.
## The above citation identified the solution which is the next line.
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
ng <- TermDocumentMatrix(theCorpus, control = list(tokenize = BigramTokenizer))
ngrs <- removeSparseTerms(ng, .01)
inspect(ngrs[1:10,])
## <<TermDocumentMatrix (terms: 10, documents: 3)>>
## Non-/sparse entries: 30/0
## Sparsity : 0%
## Maximal term length: 17
## Weighting : term frequency (tf)
##
## Docs
## Terms BlogsSample.txt NewsSample.txt TwitterSample.txt
## a b and 1 1 1
## a bad day 2 3 2
## a bad idea 4 3 2
## a bad thing 7 3 1
## a bag of 6 5 1
## a bar that 2 1 1
## a beautiful thing 2 1 3
## a better job 2 4 1
## a better world 1 1 1
## a big deal 5 6 5
Next create a matrix of 3-grams (triplets) and calculate the total number of their occurances across the training samples. The first 10 and last 10 triplets and their occurance counts are displayed.
## Build a matrix of triplets and their occurance count in the training sample.
nGramMatrix = as.data.frame(as.matrix(ngrs)) ## convert the ngram matrix to a data frame
nGramMatrix$occurances <- rowSums(nGramMatrix[,1:3]) ## calculate number of occurances of each n-gram
nGramMatrix <- subset(nGramMatrix, select = c("occurances")) ## add column to data frame with totals
nGramMatrix <- setDT(nGramMatrix, keep.rownames = TRUE)[] ## use row names to create a new column
names(nGramMatrix)[names(nGramMatrix) == 'rn'] <- 'triplet' ## name the new column
print(head(nGramMatrix, 10)) ## display first 10 n-grams
## triplet occurances
## 1: a b and 3
## 2: a bad day 7
## 3: a bad idea 9
## 4: a bad thing 11
## 5: a bag of 12
## 6: a bar that 4
## 7: a beautiful thing 6
## 8: a better job 7
## 9: a better world 3
## 10: a big deal 16
print(tail(nGramMatrix, 10)) ## display last 10 n-grams
## triplet occurances
## 1: youre not the 4
## 2: youre one of 6
## 3: youre sick of 3
## 4: youre talking about 9
## 5: youre willing to 5
## 6: yourself and your 5
## 7: youve got a 8
## 8: youve got to 22
## 9: youve made it 3
## 10: youve never seen 3
The above will be used as the basis for forming an algorithm to suggest a word based on the previous word pair. Basically I will suggest the word that occurs most frequently following the two previous words.