Introduction

This milestone report will be applying data science in the area of natural language processing. The following lines addressing the data extraction, cleaning and text mining of the so called HC Copora. This report is part of the data science capstone project of Coursera and Swiftkey. The plots, code chunks and remarks will explain the reader the first steps to build a prediction application.

Data Processing

The data set consists of three files in US English.

Loading The Dataset

fileURL <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, destfile = "Dataset.zip", method = "curl")
unlink(fileURL)
unzip("Dataset.zip")

Aggreagating A Data Sample

In order to enable faster data processing, a data sample from all three sources was generated.

sampleTwitter <- twitter[sample(1:length(twitter),10000)]
sampleNews <- news[sample(1:length(news),10000)]
sampleBlogs <- blogs[sample(1:length(blogs),10000)]
textSample <- c(sampleTwitter,sampleNews,sampleBlogs)

Summary Statistics

The following table provides an overview of the imported data. In addition to the size of each data set, the number of lines and words are displayed.

File Name File Size in Megabyte Line Count Word Count
Blogs 200.42 899288 37334147
News 196.28 1010242 34372530
Twitter 159.36 2360148 30373603
Aggregated Sample 2.42 15000 15000

A word cloud usually provides a first overview of the word frequencies. The word cloud displays the data of the aggregated sample file.

trigramTDM <- TermDocumentMatrix(finalCorpus)
wcloud <- as.matrix(trigramTDM)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(d$word,d$freq,
          c(5,.3),50,
          random.order=FALSE,
          colors=brewer.pal(8, "Dark2"))

Building A Clean Text Corpus

By using the tm package the sample data gets cleaned. With cleaning it is meant that the text data is converted into lower case, further punction, numbers and URLs are getting removed. Next to that stop and profanity words are erased from the text sample. At the end we are getting a clean text corpus which enables an easy subsequent processing.

The used profanity words can be inspected in this Github Repository.

## Make it work with the new tm package
cleanSample <- tm_map(cleanSample, content_transformer(function(x) iconv(x, to="UTF-8", sub="byte")), 
                      mc.cores=2)
cleanSample <- tm_map(cleanSample, content_transformer(tolower), lazy = TRUE)
cleanSample <- tm_map(cleanSample, content_transformer(removePunctuation))
cleanSample <- tm_map(cleanSample, content_transformer(removeNumbers))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) 
cleanSample <- tm_map(cleanSample, content_transformer(removeURL))
cleanSample <- tm_map(cleanSample, stripWhitespace)
cleanSample <- tm_map(cleanSample, removeWords, stopwords("english"))
cleanSample <- tm_map(cleanSample, removeWords, profanityWords)
cleanSample <- tm_map(cleanSample, stemDocument)
cleanSample <- tm_map(cleanSample, stripWhitespace)

The N-Gram Tokenization

In Natural Language Processing (NLP) an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

The following function is used to extract 1-grams, 2-grams and 2-grams from the cleaned text corpus.

ngramTokenizer <- function(theCorpus, ngramCount) {
        ngramFunction <- NGramTokenizer(theCorpus, 
                                Weka_control(min = ngramCount, max = ngramCount, 
                                delimiters = " \\r\\n\\t.,;:\"()?!"))
        ngramFunction <- data.frame(table(ngramFunction))
        ngramFunction <- ngramFunction[order(ngramFunction$Freq, 
                                             decreasing = TRUE),][1:10,]
        colnames(ngramFunction) <- c("String","Count")
        ngramFunction
}

By the usage of the tokenizer function for the n-grams a distribution of the following top 10 words and word combinations can be inspected. Unigrams are single words, while bigrams are two word combinations and trigrams are three word combinations.

Top Unigrams

unigram <- readRDS("./unigram.RDS")
unigramPlot <- gvisColumnChart(unigram, "String", "Count",                  
                            options=list(legend="none"))

print(unigramPlot, "chart")

Top Bigrams

bigram <- readRDS("./bigram.RDS")
bigramPlot <- gvisColumnChart(bigram, "String", "Count",                  
                            options=list(legend="none"))

print(bigramPlot, "chart")

Top Trigrams

trigram <- readRDS("./trigram.RDS")
trigramPlot <- gvisColumnChart(trigram, "String", "Count",                  
                            options=list(legend="none"))

print(trigramPlot, "chart")

Interesting Findings

Next Steps For The Prediction Application

As already noted, the next step of the capstone project will be to create a prediction application. To create a smooth and fast application it is absolutely necessary to build a fast prediction algorithm. This is also means, I need to find ways for a faster processing of larger datasets. Next to that, increasing the value of n for n-gram tokenization will improve the prediction accuracy. All in all a shiny application will be created which will be able to predict the next word a user wants to write.

All Used Code Scripts

All used code snippets to generate this report can be viewed in this repository.

Session Informations

sessionInfo()
## R version 3.1.3 (2015-03-09)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.2 (Yosemite)
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] googleVis_0.5.8        stringi_0.4-1          DT_0.0.8              
##  [4] stringr_0.6.2          wordcloud_2.5          rJava_0.9-6           
##  [7] RWeka_0.4-24           slam_0.1-32            SnowballC_0.5.1       
## [10] tm_0.6                 NLP_0.1-6              qdap_2.2.1            
## [13] RColorBrewer_1.1-2     qdapTools_1.1.0        qdapRegex_0.2.1       
## [16] qdapDictionaries_1.0.3 RWekajars_3.7.12-1    
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1      bitops_1.0-6        chron_2.3-45       
##  [4] colorspace_1.2-6    data.table_1.9.4    DBI_0.3.1          
##  [7] devtools_1.7.0      digest_0.6.8        dplyr_0.4.1        
## [10] evaluate_0.5.5      formatR_1.0         gdata_2.13.3       
## [13] gender_0.4.3        ggplot2_1.0.1       grid_3.1.3         
## [16] gridExtra_0.9.1     gtable_0.1.2        gtools_3.4.1       
## [19] htmltools_0.2.8     htmlwidgets_0.3.2   httr_0.6.1.9000    
## [22] igraph_0.7.1        jsonlite_0.9.15     knitr_1.9          
## [25] magrittr_1.5        MASS_7.3-40         munsell_0.4.2      
## [28] openNLP_0.2-4       openNLPdata_1.5.3-1 parallel_3.1.3     
## [31] plotrix_3.5-11      plyr_1.8.1          proto_0.3-10       
## [34] Rcpp_0.11.5         RCurl_1.95-4.5      reports_0.1.4      
## [37] reshape2_1.4.1      RJSONIO_1.3-0       rmarkdown_0.5.1    
## [40] rstudioapi_0.2      scales_0.2.4        tools_3.1.3        
## [43] venneuler_1.1-0     xlsx_0.5.7          xlsxjars_0.6.1     
## [46] XML_3.98-1.1        yaml_2.1.13