Input Data

Data was downloaded from the course provided link and was available in four languages. As I am only fluent in English, I only used the files in the en_US directory. Three files, approximately 200MB each were provided containing text from the following sources: News, Blogs, and Twitter.

Data Preparation

My first attempt to access the files was with the tm text mining package, which provided a framework to treat all documents in a single directory as a corpus and also contained a number of conversion and filtering functions. Initially I had difficulty reading the entire contents in of one of the files using tm. The Capstone discussion forum provided the diagnosis and solution: There were embedded NUL characters that were signaling end-of-file prematurely. The solution was to open the files in binary mode and filter out the unwanted characters.

Filtering

Once I was able to completely read in the file contents I used gsub to filter each file using the following regular expression: [^A-Za-z'?!. ]. This preserved alphabetic characters and removed the embedded NULs along with all numbers and punctuation and special characters except for the space, single quote, and the three sentence ending characters: period, question mark, and exclamation point.

Next, I used the Google Bad Word List and gsub to replace all words from this list with asterisks to identify that a word has been removed. This would allow me to skip over sequences of words in which an offending word had been removed so as not to construct non-existent n-grams. Finally, I replaced all white space characters with a space. Though they were not significantly smaller in size, the filtered files were considerably faster to read. I suspect this is because the filtered files could be opened in standard mode (not binary).

Tagging

Using the filtered files as input, I performed a second round of processing with gsub to standardize the content. This process was done incrementally and relied on visual inspection of the resulting text files to verify the transformations were yielding the intended result.

  1. Convert all alpha characters to lowercase
  2. Normalize intra-line sentence endings by replacing exclamation points and question marks with periods
  3. Drop repeating periods
  4. Drop repeating single quotes
  5. Add ‘[’ and ’]’ to the beginning and ending of each line, respectively.
  6. Remove any periods between the last word and the ‘]’ character
  7. Remove any quotes that were not embedded between alphabetic characters.
  8. Replace all consecutive white space characters with a single space

This resulted in lines that looked like this:

[ single sentence line ]                
[ sentence with * stop word removed ]   
[ mutli sentence line . second sentence ]  

Explortory Data Analysis

n-grams

After researching a number of text-mining packages, it appeared that a fairly common way to represent n-grams (basically frequency tables for consecutive sequences of one, two, three, etc. words) was via sparse matrices, which seemed to counter-intuitive to me. n-grams could more directly be stored as rows in a table or data.frame, along with an arbitrary amount of associated data, such as count, frequency, cumulative frequency, and other properties. Each word could also be assigned a unique integer id (much like the implementation of a factor in R), which would save significant memory if the ids were used in storing the n-gram data structures for n > 1.

My preference for data.frame type structures is the data.table, with which I built frequency tables for 1, 2, 3, & 4 n-grams. The uni-gram stored the character string representation of the words and mapped a unique integer id to each that was used in constructing the higher order n-grams.

Single Word Frequency

In total the 3 filtered files contained 4.2M lines of text. Looking at frequency distribution from a sample of just over 5% of these lines showed that the vast majority of the 25,390 distinct words occur with a frequency of less than 1%.

library(data.table)
load(file.path(getwd(), 'data/en_US/test1_ng1.RData'))
f1 <- data.table(f1)
table(round(f1$pct,2))
## 
##     0  0.01  0.02  0.03  0.07 
## 25369    13     5     2     1

Further inspection showed the number of observations is quite skewed. To see anything meaningful in a plot, we need to log transform the number of observations. This reveals that there are an overwhelming number of words that only appear 1 or 2 times. These will clearly not be useful from which to construct bi-grams or higher order n-grams.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
qplot(log(f1$obs))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

By setting aside frequencies of 1 and 2, a meaningful histogram can be made. Here I plot the frequency of observations from 3 to 20

qplot(f1[obs %in% 3:20, obs])
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Filtering low frequency words

Based on the exploratory data analysis done to this point, I added a cumulative frequency to the uni-gram data, based on the percentage of observations for each word ordered from most frequent to least. This allowed me to specify a cumulative frequency cut point to filter out low frequency words.

I started with a 99% cut point and examined the lowest frequency words that remained. The majority of the remaining lowest frequency words were either what appeared to be uncommon surnames or gibberish that appeared to be the result of random key pressing. I continued to drop the cut-point by 1% and it wasn’t until I reached 91% that the lowest frequency remaining words seemed to favor what I believed to be words that one might expect to encounter somewhat regularly. In the end I settled on a cut-point of 89% as one that struck a good balance between a reasonable working word list size (6,882) for prediction and filtering out items that had almost no chance of being entered into the Shiny prediction app.