Capstone Project Milestone Report

Synopsis

The goal of this report is to display that I’ve gotten used to working with the data and that I am on track to create my prediction algorithm. This report will address the following:

Demonstrate that I’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that I’ve amassed so far.
Provide my plans for creating a prediction algorithm and Shiny app to gather feedback.

R Libraries Utilized

The following R libraries are utilized for this analysis

library(tm) # framework for text mining
library(SnowballC) # provides wordStem() for stemming
library(RColorBrewer) # generate palette of colours for plots
library(ggplot2) # plot word frequencies
library(scales) # format axis scales for plots
library(Rgraphviz) # provides term correlation plots
library(tidyr) # assists in cleaning & preparing data
library(dplyr) # assists in data manipulation, transformation, & summarization
library(RWeka) # tokenzation

Loading Data

Since all three files are extremely large (twitter file: 167.1MB, blog file: 210.2MB, news file: 205.8MB), I am going to take a small randomized sample of each file to explore their contents.

# read in blogs data and select a random 10% of the lines
blogs <- readLines("~/Desktop/Personal/Education & Training/Coursera/Capstone/final/en_US/en_US.blogs.txt")
set.seed(123)
blogs <- blogs[rbinom(length(blogs)*.10, length(blogs), .5)]
write.csv(blogs, file = "~/Desktop/Personal/Education & Training/Coursera/Capstone/Capstone/Sample/blog.sample.csv", row.names = FALSE, col.names = FALSE)

# read in news data and select a random 10% of the lines
news <- readLines("~/Desktop/Personal/Education & Training/Coursera/Capstone/final/en_US/en_US.news.txt")
set.seed(123)
news <- news[rbinom(length(news)*.10, length(news), .5)]
write.csv(news, file = "~/Desktop/Personal/Education & Training/Coursera/Capstone/Capstone/Sample/news.sample.csv", row.names = FALSE, col.names = FALSE)

# read in twitter data and select a random 10% of the lines
twitter <- readLines("~/Desktop/Personal/Education & Training/Coursera/Capstone/final/en_US/en_US.twitter.txt")
set.seed(123)
twitter <- twitter[rbinom(length(twitter)*.10, length(twitter), .5)]
write.csv(twitter, file = "~/Desktop/Personal/Education & Training/Coursera/Capstone/Capstone/Sample/twitter.sample.csv", row.names = FALSE, col.names = FALSE)

# clean up global environment
rm(blogs, news, twitter)

# combine samples datasets into a Corpus
full <- Corpus(DirSource("~/Desktop/Personal/Education & Training/Coursera/Capstone/Capstone/Sample/"), readerControl = list(language="en_US"))

Exploring the Corpus

The resulting sample should be inspected to gain a basic understanding of the data captured and to ensure the data was loaded properly. The following functions provide insight into the number of documents in a Corpus, the metadata attached for each document, the length of each document, and to view the first line or two of each document to see its content.

# view the number of documents in the Corpus
length(full)

## [1] 3

# list the three documents in the Corpus
summary(full)

##                    Length Class             Mode
## blog.sample.csv    2      PlainTextDocument list
## news.sample.csv    2      PlainTextDocument list
## twitter.sample.csv 2      PlainTextDocument list

# view metadata for one of the documents
meta(full[[1]])

## Metadata:
##   author       : character(0)
##   datetimestamp: 2014-11-11 01:52:05
##   description  : character(0)
##   heading      : character(0)
##   id           : blog.sample.csv
##   language     : en_US
##   origin       : character(0)

# view the length of the sample blog data
length(full[[1]]$content)

## [1] 89929

# view the length of the sample news data
length(full[[2]]$content)

## [1] 101025

# view the length of the sample twitter data
length(full[[3]]$content)

## [1] 236015

# view the first line of the sample blog data (I use $content[2] since write.csv added "x" as the first line.  It also added "\" to fill in spaces at the beginning and end of lines.  This is okay since I will remove these characters when I clean & process the data)
full[[1]]$content[2]

## [1] "\"one together for it is always the one thing we push out\""

Pre-processing the Corpus

Since raw text formats can cause significant issues when text mining, it’s necessary to pre-process text data by using common transformation and filtering functions. These pre-processing functions clean the text and structure the data in a format that allows for appropriate text mining. Note that my profanity filter leverages a pre-existing list which can be found here.

# create & apply function to remove special characters
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
docs <- tm_map(full, toSpace, "/|@|\\|")

# remove numbers as they are likely not going to provide added benefit to my analysis
docs <- tm_map(docs, removeNumbers)

# for my initial analysis I will remove punctuation
docs <- tm_map(docs, removePunctuation)

# transform words to lower case
docs <- tm_map(docs, tolower)

# remove profanity words
profanity <- read.csv("~/Desktop/Personal/Education & Training/Coursera/Capstone/Capstone/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/en", header=FALSE, stringsAsFactors=FALSE)
profanity <- profanity$V1
docs <- tm_map(docs, removeWords, profanity)

# remove white space
docs <- tm_map(docs , stripWhitespace)

# for my initial analysis I will remove stopwords; this eliminates overly common words that add little value in my understanding of the document(s) context.  Later on I will keep stopwords in when performing tokenization and analyzing for n-grams.
docs_context <- tm_map(docs, removeWords, stopwords("english"))

# perform stemming to remove complexity introduced by multiple suffixes and get word radicals
docs_context <- tm_map(docs_context, stemDocument, language = "english")

# reformat documents to PlainTextDocument to allow for DocumentTermMatrix and TermDocumentMatrix formats
docs_context <- tm_map(docs_context, PlainTextDocument)

Text Mining the Corpus

Now that the text data has been pre-processed, text mining can be performed to understand word counts & line counts, term frequencies, and common n-grams in our sample data.

Summary Statistics

Figure 1 provides basic summary statistics of our sample data set. It shows that the Blog and News data have very similar counts, although the content may very significantly. Twitter, on the other hand, has significantly more lines yet less word counts.

Term Frequency

The first analytic approach in text mining is to gain an understanding of the rate of occurance of terms. This identifies terms most frequently found in the documents along with terms infrequently found, and likely, not of high interest. To perform this analysis, I use DocumentTermMatrix() and TermDocumentMatrix() to create a term matrix which allows for the summation and summarization of term frequencies. In total, there are 15,492 unique terms in the sample blog data, 15,913 unique terms in the sample news data, and 10,063 unique terms in the sample Twitter data. Figure 2 lists the 20 most common terms identified in each of the sample data sources.

Across all three sample data sets, there are 3,899 unique words; however, keep in mind these are context words and do not include profanity or stop words such as “me”, “you”, “is”, “of”, etc (all stop words removed can be found in stopwords("en")) To assess the commonality of word popularity between the three data sources, Figure 3 identifies the top 20 common words used across all three data sources. This figure shows that, although, Twitter tends to have restrictive character length, the frequency of many of these common words are relatively similar across data sources (ie *“will”, “going”“,”make“* all have approximately similar frequencies). However, other terms (ie “just”, “like”, “get”, etc) do differ considerably in their frequencies.

n-Grams

Understanding the frequency of individual words is important; however, since our task in this Capstone class is to build an algorithm that predicts a word based on previous word inputs it is important to understand how words are grouped together. An n-gram is simply a contiguous sequence of words and these n-grams are the foundation of building predictive algorithms.

There are 157,817 2-grams identified in one or more of the three sample data sets and 4,662 2-grams that are common (or are found in) all three data sets with 62% sparsity. Figure 4 provides the 25 most common 2-grams common in all three sample data sources.

There are 237,394 3-grams identified in one or more of the three sample data sets and 1,050 3-grams that are common (or are found in) all three data sets with 66% sparsity. Figure 5 provides the 25 most common 3-grams found in all three sample data sources.

Interesting Findings

The following are a few interesting items I discovered and will need to address in my final predictive model. Some of these items can be addressed in initial pre-processing, some may require coding that finds similar words as in the case of miss-spellings, and others may not be addressed adequately at all.

Several foreign words have been identified in the sample Blog data, and interestingly enough, they are mixed in with english words. This results in n-grams that contains phrases such as “πνευματικον which means”
The sample Twitter data contains the following “” 61 times
48 terms that are over 20 characters in length were used 2,147 times (1432 times in Twitter data: 67%, 454 times in Blog data: 21%, and 261 times in News data: 12%). These terms includes URLs (ie: wwwchrisandamandswonderyearswordpresscom), combined words that may represent hashtags (ie: funniestthingiheardtoday), and non-words (ie: plsplsplsplsplsplsplsplsplspls).
Miss-spelled and/or abbreviated words to include aaaand, yourre, yrs, txt, spf, etc.
It appears that line numbers may have been incorporated into some of the lines in the sample News data. Several words in the form of “\u0096” made it through my pre-processing. This resulted in a few instances of n-grams such as *“\u0097” before adversity“*.

Plans for Final Project

I am still developing my plans for my final project; however, I have started to scope out the basic process. The following is a very generalized overview of my planned approach.

Step 1: Develop a larger training set

I found that using the tm and RWeka libraries has its limitations. The processing time and memory requirements to analyze more than 10% of the three datasets in one corpora is unrealistic. As a result, I recently created a new process in developing a training set much larger than the 10% I analyzed above. First, I take 10% of the primary dataset as sample #1. Second, I another 10% sub sample of the remaining datasets. Third, I take another 10% sub sample of the remaining dataset and so forth for 10 sub samples. Each training set will be its own corpora in which I will clean and pre-process each one independently. The final training set will equate to approximately 65% of each of the Blog, News, and Twitter data set versus just 10% as analyzed above.

Step 2: Processing the multiple training sets

For each of the sub sample corporas, I will perform similar data processing as above in which I will remove punctuations, numbers, convert to lowercase, etc. One area to note is I have already resolved items 1, 2, and 3 from the Interesting Findings section above. In my pre-processing above I created a function to remove the “/”, “@”, and “\” characters; however, later I found that using “[^[:alnum:]]” in my docs <- tm_map(full, toSpace, "[^[:alnum:]]") function resolved these issues. Each corpora will be processed individually.

Next, I will mine each corpora to develop 2-, 3-, 4-, and 5-gram lists. I will then combine the multiple n-gram lists from each corpora together, which creates an n-gram list for my entire training set. As an example, when I produced the 3-gram list above, I identified a total of 237,394 3-grams for my 10% training sample. However, when I developed my 3-gram list for my 10 training sub samples and then combined them into one master list, I identified 2,026,622 unique 3-grams. This creates far more power in my algorithms predictive capabilities.

Step 3: Creating a prediction algorithm

In my shiny app I will have a text box where the user provides an input phrase. For instance, this phrase could be “Johns Hopkins team, thanks for the great” From the desired phrase I will take the final 2, 3, or 4 words (depending on length of user input) as my initial user phrase. In my example phrase (“Johns Hopkins team, thanks for the great”) the input is 7 words. Since my longest n-gram will be 5-grams I would take the last four words (“team, thanks for the great”). I will then process this text to remove any punctuation, turn to lowercase, etc. I will then search for this phrase in my 5-gram list and identify the 5-gram that has the highest probability (using the summation of log probabilities approach to minize underflow) with this phrase. Finally, I will select the fifth word in this 5-gram as the predicted word to display.

Step 4: Issues to overcome

So far I have performed most of steps 1-3 with success. I used this process to successfully answer the Quiz 2 questions; however, there are several issues that I need to learn about and address in my algorithm. These issues include:

Misspellings: I need to identify a process that includes identifying similar words to the users input in case they misspelled a word. This will likely leverage the agrep() function.
Synonyms: I need to identify a process that will incorporate synonyms. If a user inputs a certain word that is not in my data set such as “focus”, I will need to be able to substitute a synonym such as “target, center, core, etc” that will also be searched for in my data set.
No Matches: If the user inputs a phrase that has no matches (even after I incorporate a process that uses synonyms and approximate matches with agrep()), I will need to have a sufficient process in place to deal with this situation. This may include creating interaction with the user in which I request another similiar phrase or, if the user input a long phrase I could take n-words from the beginning part of the phrase and perform a search to try identify relationships.

Conclusion

As an analyst that has traditionally focused on numerical analyses my entire career, text mining is a brand new analytic technique that I have been unfamiliar with. I have been spending a lot of time familiarizing myself with the basics, understanding the limitations of current text mining libraries, and learning to create new approaches without libraries to overcome some of these limitations. Although my work with this project may not be cutting edge, I am just greatful for being able to add this new skill set to my analytic capabilities.