Data Science Capstone project - Swiftkey project

In this project, we have analyzed news large corpora of data coming from a number of sources. This first milestone is a report on the analysis of the documents and their vocabulary.

Executive Summary

In this exercise we have analyzed the three large corpora provided: twitter, news, and blogs. All have very different characteristics, including most frequent words.

We have seen that the most frequent words are, not surprisingly, stop words. We removed those from the analysis, but will not remove them from the prediction model.

Further experiments showed that bi-grams seem the be most prevalent list of words. We will therefore build a HMM (Hidden Markov Model) prediction model based on current word, predicting the next one.

Ideally, we could build one model for each of the datasets, as they are very different. However, for the sake of simplicity and coverage, we will build one universal model.

Data processing

We will download the files from their location, unzip them, and collect basic statistics using the wc command.

# download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip','DSCapstone.zip')
# unzip('DSCapstone.zip')
setwd('C:/Users/fragnet/Documents/final/en_US')
system("wc -l -w *.txt", intern=TRUE)

## [1] "   899288  37272578 en_US.blogs.txt"  
## [2] "  1010242  34309642 en_US.news.txt"   
## [3] "  2360148  30341028 en_US.twitter.txt"
## [4] "  4269678 101923248 total"

The twitter dataset has the most entries with over 2.3M entries but contains the least words (30 million). Blogs and News are slightly bigger files with longer entries, with slightly more words (37 and 34 million respectively), for much fewer entries (around 900K blogs and 1M news).

Analysis of twitter package

Let us first load a number of required libraries, then open a file handle on the US twitter feed, and read a sample of it (the file is way too large to process it all).

Let us now use the tm and other associated libraries and create a corpus. We will apply some basic pre-processing on this corpus, like removing numbers, punctuation, and white spaces.

library(tm)
library(SnowballC)
library(tau)
#Read the corpus
corpus <- Corpus(VectorSource(sample))
#Cleanup using various TM primitives
corpus <- tm_map(corpus, content_transformer(tolower))
#corpus2<- tm_map(corpus, stemDocument)
corpus<- tm_map(corpus, removeNumbers)
corpus<- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
corpus<- tm_map(corpus, stripWhitespace)

For more advanced processing, we will now create a custom content transformer to remove most common profanity words, based on simple regular expression mapping.

# Define two custom content transformers functions - remove, replace
(profanity <- content_transformer(function(x) gsub("\\w*(fuck|bitch|shit)\\w*", "", x)))
#Use custom content transformer to remove profanity words
corpus<- tm_map(corpus, profanity)

Using tm, let us now create a Document Term Matrix from the full corpus, then find the 10 most frequent words in the corpus:

dtm <- DocumentTermMatrix(corpus)
dtm<-removeSparseTerms(dtm,0.9999)
matDtm <- as.matrix(dtm)
frequency <- colSums(matDtm)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency,10)

##  the  you  and  for that have with your  are this 
## 9292 5408 4361 3852 2357 1739 1730 1693 1636 1597

Let us now do the same after removing english stop words, this time after removing english stop words:

corpus2 <- tm_map(corpus, removeWords, stopwords("en"))
dtmNoSW <- DocumentTermMatrix(corpus2)
dtmNoSW<-removeSparseTerms(dtmNoSW,0.9999)
matDtmNoSW <- as.matrix(dtmNoSW)
frequency2 <- colSums(matDtmNoSW)
frequency2<- sort(frequency2, decreasing=TRUE)
head(frequency2,10)

##   just   like    get   love   good   will   dont    day    can thanks 
##   1557   1207   1070   1047    976    923    904    900    869    860

Clearly, the words after removing english stop words are much more relevant and meaningful for analysis - stop words come up first otherwise. Let us print a wordcloud of relevant keywords, as well as a graph with frequencies.

## Warning in wordcloud(words[1:100], frequency2[1:100], colors = pal): just
## could not be fit on page. It will not be plotted.

Bigram and trigram distributions

Let us now plot the bigram and trigram distributions. We actually start with a model that accomodates both bigrams and trigrams.

library(RWeka)
library(tm)
library(wordcloud)
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3))
tdm <- TermDocumentMatrix(corpus2, control = list(tokenize = trigramTokenizer))
tdm2<-removeSparseTerms(tdm,0.9995)
paste("dimensions of the resulting matrix:",dim(tdm2))

## [1] "dimensions of the resulting matrix: 273"  
## [2] "dimensions of the resulting matrix: 23601"

matTdm2 <- as.matrix(tdm2)
frequency <- rowSums(matTdm2)
frequency <- sort(frequency, decreasing=TRUE)
words <- names(frequency)
wordcloud(words[1:50], frequency[1:50],colors=pal)

We see a few interesting patterns appear here as well. It looks like bi-grams will be a good basis for predicting the next keyword.

NEWS Corpus

We repeat the same process for the News corpus this time to generate the corresponding wordcloud. It looks very different.

## Loading required package: NLP
## Loading required package: RColorBrewer

Blogs Corpus

We repeat the same process for the blog corpus this time to generate the corresponding wordcloud. It looks very different again. A few strange characters are appearing, we will attempt to remove those in a future version of the content transformers.

```

Data Science Capstone

Francois Ragnet