Milestone Report - Johns Hopkins Data Science Capstone Project

This report details the current status of my project. I have downloaded the corpora, loaded a subset of it into a tm Corpus object and done some basic exploratory analysis.

We are focusing on three downloaded files of written text, consisting primarily of US English. This text originates from three sources, Twitter, blogs, and news items.

Due to time constraints and repeated failures of my system while processing this data, I have only loaded a very small subset of this data into a tm Corpus object. I will provide very basic statistics for the complete files, and more detailed statistics only for the group that has been loaded into a Corpus.

Basic Stats for Text Files

Using bash commands, I was able to get line and word counts for the text files themselves. For example:

#bash code
wc -l en_US.news.txt
wc -w en_US.news.txt

News Word Count: 34,372,530
News Line Count: 1,010,242
Blogs Word Count: 37,334,131
Blogs Line Count: 899,288
Twitter Word Count: 30,373,583
Twitter Line Count: 2,360,148

Using the countLines() command available in the R.utils package gives the same numbers for line count. However, I was unable to find an available function in R that would give a word count for the files without crashing.

#R code
countLines("en_US.news.txt")
countLines("en_US.blogs.txt")
countLines("en_US.twitter.txt")

Detailed Description of a Small Sample of the Corpora

I have loaded a small subset (1000 twitter documents, 250 blog items and 250 news documents) into a tm Corpus object to process and analyze.

Pre-Processing

I decided it was not necessary to run any stemming processes on this data (stemming would, for example, change “jumped” to “jump”.) The same is true of removing stopwords which would remove common words like “to” and “of”. It seems to me that those transformations would be useful if were were trying to determine what the meaning of the text was. However, that is not what we're doing here. We're simply trying to find commonly uttered phrases so we can predict when somebody is going to use one of them. Understanding their meaning is not important. I will have to think about this more, but that is my current position on this question.
Regarding the question of whether we'd want our app to predict profane words, we may well not want it to. However, I decided the best course of action was to process and predict on all data as is. Stripping profanity out of the corpus would leave us with a lot of incomplete sentences missing their main noun or verb, and this would create strange n-grams. My plan is to predict all words in my model, including profanity. If the model predicts a profane word, I will, at the level of the app, remove the word from the results that the user sees. Since the profanity will be included in the model, this approach will even make it easy to provide the user with an option which allows them to display profanity in their prediction results if they choose. It will also allow us to change our definition of what profanity is without reprocessing the text again.
Regarding non UTF-8 characters: While using ReadLines to bring the data into a vector, I specified the encoding as UTF-8. However, some bad characters still came in. My method was to remove these characters using iconv.

iconv(a, "UTF-8", "UTF-8",sub='')

Sentence and N-Gram Tokenization:

My first step was to break each document into sentences using a custom tm transformation function I created which seperates text based on the characters .:;?!

I then ran the RWeka NGramTokenizer on each sentence seperately so that no n-grams would be created across sentences. N-grams created across sentences would provide little if any predictive value and this method was intended to avoid them.

From each sentence I created lists of unigrams, bigrams and trigrams. These lists were subsequently converted to all lower case.

For twitter data I first used a regular expression (shown below) to replace emoticons with periods.

gsub("[:;8][-o']?[DPb)(\\]|[Dd)(\\][-o']?[:;8]", ".", a[1:1000])

This was done to catch cases where an emoticon was used in lieu of a sentence ending character. For example:

lmfao!!!!! You outta control #Bestie!! I got a dinner/movie date for Friday nite:-) you know ill tell u all about it.

N-Gram Analysis

Below is some analysis of number of words or n-grams appearing per document. The histograms below display the distribution by n-gram type.

setwd("C:\\Users\\Ben\\SkyDrive\\Documents\\Certifications and training\\Data Science Specialization\\10 - Capstone\\Data\\Coursera-SwiftKey\\final\\en_US")

#lists of lists. Each of the inner lists contains the n-grams for the corresponding #document in the corpus
load("Bigrams")
load("Trigrams")
load("Unigrams")

par(mfrow = c(3,1))

hist(sapply(sapply(Trigrams, unlist), length), col="orange", xlab="Number of Trigrams", ylab="Number of Documents", main="Trigrams per Document")

hist(sapply(sapply(Bigrams, unlist), length), col="green", xlab="Number of Bigrams", ylab="Number of Documents", main="Bigrams per Document")

hist(sapply(sapply(Unigrams, unlist), length), col="blue", xlab="Number of Unigrams", ylab="Number of Documents", main="Words per Document")

plot of chunk unnamed-chunk-6

Below is some analysis of frequency of certain words or n-grams in the data. Wordclouds are used to visualize the relative frequency of the top 25 n-grams of each type.

setwd("C:\\Users\\Ben\\SkyDrive\\Documents\\Certifications and training\\Data Science Specialization\\10 - Capstone\\Data\\Coursera-SwiftKey\\final\\en_US")
library(plyr);

#load previously calculated n-grams from disk. These were created by stacking all values
#from the lists of N-grams produced by each document into a single vector of n-grams.
load("StackedTrigrams")
load("StackedBigrams")
load("StackedUnigrams")

#convert to lower case and put into data frames for analysis
allTrigrams<-data.frame(Tri=tolower(StackedTrigrams))
allBigrams<-data.frame(Bi=tolower(StackedBigrams))
allUnigrams<-data.frame(Uni=tolower(StackedUnigrams))

#For each type of n-gram, create a count for each term, then order by frequency
GroupedWC<-ddply(allUnigrams, "Uni", summarise, thecount=length(Uni));
TopWords<-GroupedWC[order(GroupedWC[,2], decreasing=TRUE),];

GroupedBC<-ddply(allBigrams, "Bi", summarise, thecount=length(Bi))
TopBigrams<-GroupedBC[order(GroupedBC[,2], decreasing=TRUE),];

GroupedTC<-ddply(allTrigrams, "Tri", summarise, thecount=length(Tri))
TopTrigrams<-GroupedTC[order(GroupedTC[,2], decreasing=TRUE),];

library(wordcloud);
wordcloud(TopTrigrams$Tri[1:25], TopTrigrams$thecount[1:25], scale=c(5, .3), color="orange")

alt text

library(wordcloud)
wordcloud(TopBigrams$Bi[1:25], TopBigrams$thecount[1:25], scale=c(7, .5),  color="green")

alt text

library(wordcloud)
wordcloud(TopWords$Uni[1:25], TopWords$thecount[1:25], scale=c(8, .6),  color="blue")

alt text