Capstone Project Milestone Report

Executive Summary

The Capstone Project was initiated to build a predictive word model based an input phrase with multiple words. The milestone report details the status of each project tasks lists.

capstone project tasks

The project has proved to be especially challenging given the lack of familiarity with natural language processing. At the same time, these challenges have been exciting due to the amount of new knowledge being ascertained.

The milestone reports includes a review of source data and exploratory data analysis along with forward looking processes for the future project tasks.

Observations Gained to Date

The data contain characters (digits, control codes, foreign launguge characters, punctuation, etc) that will not add much predictive value.
An approach should be developed to identfy sentences within the data sets to eliminate n-grams where the last word in one sentence and tied to the first word of the next sentence (e.g. “The dog ran fast. The boy fell down” might lead to “ran fast the” if sentences are not considered.)
The top 50% of the vocabulary covers most of the sampled data.
Lot of sparse elements exist that may not add much predictive value.

Plan for the Predictive Language Model

Load the raw data sets and perform data cleansing (including profanity filtering) as outlined similarly this milestone report.
Sample the data and build n-gram tokens (1 to 4 token n-grams) and then remove sparse tokens.
Build a launguage model using the maximum likelihood estimates of the n-gram probabilities using a lookup table with a key (n-1 word phrase) and the predicted next word, based on the highest probability.
Given an input phrase (n-words in length), the last n-words are used as lookup in the language model. If a match is found, it is returned. If not, a back-off strategy is used, to lookup based on the last n-1 words, n-2 words etc, until a match is found.
Certainly, there are additional considerations to be included but they will be discovered as the project progresses.

Task 0 - Understand the problem

The project was initiated by starting to research Natural Language Processing, Text Mining, and the associated tools in R based on links provided via the Coursera course page:

Natural language processing Wikipedia page

Text mining infrastucture in R

CRAN Task View: Natural Language Processing

Coursera course on NLP (not in R)

After exhausting the above materials general research was collected via Evernote for a variety of Google searches as well as online book resources. Those materials will be referenced as utilized during the final project report.

Task 1 - Data Acquisition and Cleaning

Data Sources

Three data sources were utlized from the Capstone dataset and loaded into R using the readLines command. The followin summary statitcis represent the source data files:

File	Line Count	Word Counts	Character Count	Avg. Char. Count	Std. Dev. Character Count
en_US.blogs.txt	899,288	39,126,759	206,912,084	230.1	258.7
en_US.news.txt	1,010,242	36,723,514	203,177,180	201.1	133.2
en_US.twitter.txt	2,360,148	32,794,523	162,091,410	68.7	37.2
Totals	4,269,678	108,644,796	572,180,674

Sample Source Files

Given the large size of the raw data sources, a 1% sample set of each data object was constructed after loading the source data into R. The samples were then summarized for comparison to the original data sources to ensure a representatve had been selected.

R Object	Line Count	Word Count	Character Count	Avg. Char. Count	Std. Dev. Char. Count
blogsSample	8,992	389,686	2,065,345	229.7	254.6
newsSample	10,102	366,565	2,025,874	200.5	132.7
tweetsSample	23,601	329,084	1,626,085	68.9	37.2
Totals	42,695	1,085,335	5,717,304

Data Cleansing Prior to Corpus Text Formation

Several steps were taken to cleanse the each sample data set prior to processing them into corpus objects:

Encoding the data as “UTF-8” (e.g. Encoding(blogs) <- "UTF-8")
Stripping out any characters not UTF-8 using the iconv command (e.g. blogsUTF8 <-iconv(news, "UTF-8", "UTF-8", sub=''))
Normalizing the test using the stringi package per code from R.K. Lancer’s github respository

library("stringi")
asciify <- function(text) { 
    text <- stri_trans_general(text, 'Any-Latin')
    text <- stri_trans_general(text, 'Any-Publishing')
    text <- stri_trans_general(text, 'Latin-Ascii')
    text
}
tweetsNormalized <- asciify(tweets)

The cleansed data samples were then combined into a single text object (rawText) and processed into a corpus called docs

docs <- Corpus(VectorSource(rawText))

Data Cleansing within Corpus Text

Cleansing within the corpus docs was then executed as follows prior to tokenization:

Defined a specific function toSpace as a content_transformer for some special characters:

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/|@|\\|#~!@#$%^&*()_+:<>?,./;\"\u2614\u26a1", mc.cores=1)

Transformed the corpus text to all lowercase to compress the data sets and also normalize the word groups using docs <- tm_map(docs, content_transformer(tolower))
Transformed the corpus text to remove nummbers from the data set using docs <- tm_map(docs, content_transformer(removeNumbers))
Removed punctuation from the corpus text using docs <- tm_map(docs, content_transformer(removePunctuation))
Applied a profanity filter to remove “dirty words” based on a GitHub repository usingdocs <- tm_map(docs, removeWords, profanityFilter) where the profanityFilter is the list of English words in the github repository
Finally, the corpus text was transformed by removing the additional whitespaces using the following R code:

# Note: A loop was the only way to achieve execution of the stripWhitespace 
# command due to errors encounted trying to use the transform_content() function
for(x in 1:length(docs)) {
    docs[[x]] <- stripWhitespace(docs[[x]])
}

Note No removeWords, stopWords cleansing was performed on the corpus text since the purpose of the predictive model itself would be inclusive of these words.

Task 2 - Exploratory Analysis

After the corpus data cleansing was completed, an exploratory analysis was undertaken by forming a DocumentTermMatrix for the corpus text.

library("tm")
dtmFinal <- DocumentTermMatrix(docs)
dtmFinal

Given the high sparsity rate (100%) and high count of terms (56293), the next step was to removeSparseTerms to make available a reasonable set of terms (428 based on the sparsity factor applied) for additional exploratory analysis

dtmFinal.ns <- removeSparseTerms(dtmFinal, 0.995)
dtmFinal.ns

A term frequency matrix was constructed based on the nonspare document term matrix and a simply ordered histrogram was platted based on the most frequent 25 words in the frequency matrix:

A word cloud was also constructed from the frequency matrix:

## Loading required package: RColorBrewer

The corpus text was next converted to a data frame for applying RWeka to form n-grams (1-gram, 2-gram, 3-gram, and 4-gram):

# Convert corpus "docs" to Data Frame for application of RWeka functions 
cleanedText <- data.frame(text=unlist(sapply(docs, '[', "content")), stringsAsFactors=FALSE)

# Build the n-grams for the model
library("RWeka")
unigrams <- NGramTokenizer(cleanedText, Weka_control(min=1, max=1))
bigrams <- NGramTokenizer(cleanedText, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
trigrams <- NGramTokenizer(cleanedText, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
quadgrams <- NGramTokenizer(cleanedText, Weka_control(min = 4, max = 4, delimiters = " \\r\\n\\t.,;:\"()?!"))

Frequency tables were then built and saved to disk from the n-grams for further exploratory analysis. The n-gram frequency tables were then sorted and plotted based on the Top 25 most frequent entries in each n-gram collection: