Introduction
Data Processing
- Loading The Dataset
- Aggreagating A Data Sample
Summary Statistics
Building A Clean Text Corpus
- The N-Gram Tokenization
Interesting Findings
Next Steps For The Prediction Application
All Used Code Scripts
Session Informations

Introduction

This milestone report will be applying data science in the area of natural language processing. The following lines addressing the data extraction, cleaning and text mining of the so called HC Copora. This report is part of the data science capstone project of Coursera and Swiftkey. The plots, code chunks and remarks will explain the reader the first steps to build a prediction application.

Data Processing

The data set consists of three files in US English.

Loading The Dataset

fileURL <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, destfile = "Dataset.zip", method = "curl")
unlink(fileURL)
unzip("Dataset.zip")

Aggreagating A Data Sample

In order to enable faster data processing, a data sample from all three sources was generated.

sampleTwitter <- twitter[sample(1:length(twitter),10000)]
sampleNews <- news[sample(1:length(news),10000)]
sampleBlogs <- blogs[sample(1:length(blogs),10000)]
textSample <- c(sampleTwitter,sampleNews,sampleBlogs)

Summary Statistics

The following table provides an overview of the imported data. In addition to the size of each data set, the number of lines and words are displayed.

File Name	File Size in Megabyte	Line Count	Word Count
Blogs	200.42	899288	37334147
News	196.28	1010242	34372530
Twitter	159.36	2360148	30373603
Aggregated Sample	2.42	15000	15000

A word cloud usually provides a first overview of the word frequencies. The word cloud displays the data of the aggregated sample file.

trigramTDM <- TermDocumentMatrix(finalCorpus)
wcloud <- as.matrix(trigramTDM)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(d$word,d$freq,
          c(5,.3),50,
          random.order=FALSE,
          colors=brewer.pal(8, "Dark2"))

Building A Clean Text Corpus

By using the tm package the sample data gets cleaned. With cleaning it is meant that the text data is converted into lower case, further punction, numbers and URLs are getting removed. Next to that stop and profanity words are erased from the text sample. At the end we are getting a clean text corpus which enables an easy subsequent processing.

The used profanity words can be inspected in this Github Repository.

## Make it work with the new tm package
cleanSample <- tm_map(cleanSample, content_transformer(function(x) iconv(x, to="UTF-8", sub="byte")), 
                      mc.cores=2)
cleanSample <- tm_map(cleanSample, content_transformer(tolower), lazy = TRUE)
cleanSample <- tm_map(cleanSample, content_transformer(removePunctuation))
cleanSample <- tm_map(cleanSample, content_transformer(removeNumbers))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) 
cleanSample <- tm_map(cleanSample, content_transformer(removeURL))
cleanSample <- tm_map(cleanSample, stripWhitespace)
cleanSample <- tm_map(cleanSample, removeWords, stopwords("english"))
cleanSample <- tm_map(cleanSample, removeWords, profanityWords)
cleanSample <- tm_map(cleanSample, stemDocument)
cleanSample <- tm_map(cleanSample, stripWhitespace)

The N-Gram Tokenization

In Natural Language Processing (NLP) an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

The following function is used to extract 1-grams, 2-grams and 2-grams from the cleaned text corpus.

ngramTokenizer <- function(theCorpus, ngramCount) {
        ngramFunction <- NGramTokenizer(theCorpus, 
                                Weka_control(min = ngramCount, max = ngramCount, 
                                delimiters = " \\r\\n\\t.,;:\"()?!"))
        ngramFunction <- data.frame(table(ngramFunction))
        ngramFunction <- ngramFunction[order(ngramFunction$Freq, 
                                             decreasing = TRUE),][1:10,]
        colnames(ngramFunction) <- c("String","Count")
        ngramFunction
}

By the usage of the tokenizer function for the n-grams a distribution of the following top 10 words and word combinations can be inspected. Unigrams are single words, while bigrams are two word combinations and trigrams are three word combinations.

Top Unigrams

unigram <- readRDS("./unigram.RDS")
unigramPlot <- gvisColumnChart(unigram, "String", "Count",                  
                            options=list(legend="none"))

print(unigramPlot, "chart")

Top Bigrams

bigram <- readRDS("./bigram.RDS")
bigramPlot <- gvisColumnChart(bigram, "String", "Count",                  
                            options=list(legend="none"))

print(bigramPlot, "chart")

Top Trigrams

trigram <- readRDS("./trigram.RDS")
trigramPlot <- gvisColumnChart(trigram, "String", "Count",                  
                            options=list(legend="none"))

print(trigramPlot, "chart")

Interesting Findings

Loading the dataset costs a lot of time. The processing is time consuming because of the huge file size of the dataset. By avoiding endless runtimes of the code, it was indispensable to create a data sample for text mining and tokenization. Nedless to say, this workaround decreases the accuracy for the subsequent predictions.
Removing all stopwords from the corpus is recommended, but, of course, stopwords are a fundamental part of languages. Therefore, consideration should be given to include these stop words in the prediction application again.
The text mining algorithm needs to be adjusted, so to speak a kind of fine-tuning. As seen in the chart of the top trigrams some words severely curtailed. For example, the second most common trigram is presid barack obama instead of president barack obama.

Next Steps For The Prediction Application

As already noted, the next step of the capstone project will be to create a prediction application. To create a smooth and fast application it is absolutely necessary to build a fast prediction algorithm. This is also means, I need to find ways for a faster processing of larger datasets. Next to that, increasing the value of n for n-gram tokenization will improve the prediction accuracy. All in all a shiny application will be created which will be able to predict the next word a user wants to write.

All Used Code Scripts

All used code snippets to generate this report can be viewed in this repository.

Session Informations

sessionInfo()

## R version 3.1.3 (2015-03-09)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.2 (Yosemite)
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] googleVis_0.5.8        stringi_0.4-1          DT_0.0.8              
##  [4] stringr_0.6.2          wordcloud_2.5          rJava_0.9-6           
##  [7] RWeka_0.4-24           slam_0.1-32            SnowballC_0.5.1       
## [10] tm_0.6                 NLP_0.1-6              qdap_2.2.1            
## [13] RColorBrewer_1.1-2     qdapTools_1.1.0        qdapRegex_0.2.1       
## [16] qdapDictionaries_1.0.3 RWekajars_3.7.12-1    
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1      bitops_1.0-6        chron_2.3-45       
##  [4] colorspace_1.2-6    data.table_1.9.4    DBI_0.3.1          
##  [7] devtools_1.7.0      digest_0.6.8        dplyr_0.4.1        
## [10] evaluate_0.5.5      formatR_1.0         gdata_2.13.3       
## [13] gender_0.4.3        ggplot2_1.0.1       grid_3.1.3         
## [16] gridExtra_0.9.1     gtable_0.1.2        gtools_3.4.1       
## [19] htmltools_0.2.8     htmlwidgets_0.3.2   httr_0.6.1.9000    
## [22] igraph_0.7.1        jsonlite_0.9.15     knitr_1.9          
## [25] magrittr_1.5        MASS_7.3-40         munsell_0.4.2      
## [28] openNLP_0.2-4       openNLPdata_1.5.3-1 parallel_3.1.3     
## [31] plotrix_3.5-11      plyr_1.8.1          proto_0.3-10       
## [34] Rcpp_0.11.5         RCurl_1.95-4.5      reports_0.1.4      
## [37] reshape2_1.4.1      RJSONIO_1.3-0       rmarkdown_0.5.1    
## [40] rstudioapi_0.2      scales_0.2.4        tools_3.1.3        
## [43] venneuler_1.1-0     xlsx_0.5.7          xlsxjars_0.6.1     
## [46] XML_3.98-1.1        yaml_2.1.13

Capstone Project - Milestone Report

Maximilian H. Nierhoff