Milestone Report - Exploratory data analysis, n-grams analysis and future plans

Introduction

This is a milestone report of the Data Science Specialization SwiftKey Capstone. The aim of this project is to apply data science in the area of natural language processing using a data-set from a corpus called HC Corpora[1]. The goal is create a data product by creating a predictive text model that can predict the next word that the user will write using a smart keyboard.

The basic goal of this milestone report using R[2] is to load and clean the data that will be used later to create a predictive model, to do some basic exploratory analysis of the data, to report any interesting findings and to get feedback on our plans for creating a prediction algorithm and Shiny app. This report is a R markdown document that can be processed by knitr[3-5]. This report contain a summary, 7 figures, 2 tables and to be appreciated by a non-data scientist manager complex R code are only available in the appendix.

Summary

Exploratory Data Analysis

Loading of the data-sets

The data-set come from a corpus called HC Corpora[1] and can be downloaded from the Coursera link: Coursera-SwiftKey [1,41 GB] using the R code of appendix I.

This is the training data-set for the Data Science Specialization SwiftKey Capstone project. There is also some documentation on the Corpora available here: Readme and the files have been language filtered but may still contain some foreign text.

The data-set from [1] can be be read as follows:

# choose the category of files to read : en_US
category<-"/en_US"
lines=10000 # for all lines use -1L
# read the corresponding 3 files
blogs_txt   <- readLines(paste("./final",category,category,".blogs.txt",sep=""), skipNul=TRUE, n =lines)
news_txt    <- readLines(paste("./final",category,category,".news.txt",sep=""), skipNul=TRUE, n =lines)
twitter_txt <- readLines(paste("./final",category,category,".twitter.txt",sep=""), skipNul=TRUE, n =lines)

Basic summary statistics of the 3 files

The function presented in appendix II provide a basic summary of a file: file size, number of line, number of non empty lines and number of word.

In the Table I a basic summary of the 3 data-sets contained in the English en_US folder is presented :

Table I: Basic summary of the files: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt
File Name	File size[MB]	Number of lines	Number of non empty lines	Number of words
en_US.blogs.txt	200.4242	10000	10000	410620
en_US.news.txt	196.2775	10000	10000	343929
en_US.twitter.txt	159.3641	10000	10000	127674

These 3 data-sets : en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt contain respectively 0.9 million, 1 million, and 2.4 millions of lines of text which is quite important and we need to take this into account for the design of the application. For the data-set exploration we took a sub-set of each file but later we can either optimize the R code to run on the full data-set or randomly select a unbiased samples form the full data-set.

Basic summary statistics of the word count distribution

We will now look at the distribution of the number of word per line for our 3 data-sets as illustrated below in Figure 1, Figure 2 and Figure 3 done with [6].

Figure 1: Number of words per line in en_US.blogs.txt

Figure 2: Number of words per line in en_US.news.txt

Figure 3: Number of words per line in en_US.twitter.txt

The function presented in appendix III provide a basic statistic summary of the distribution of the number of word per line using [7]: minimum, 1st quartile, median, mean, 3rd quartile and maximum.

In the Table II a the basic summary of the word count for the 3 data-sets is presented :

Table II: Basic statistic of the distribution of the number of word per line for the files: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt
File Name	Minimum	1st Quartile	Median	Mean	3rd Quartile	Maximum
en_US.blogs.txt	1	9	28	41.28	59	681
en_US.news.txt	1	19	32	34.81	46	302
en_US.twitter.txt	1	7	12	12.65	18	34

Let have a look at random example of word with 0, 1, 2 and 3 characters as presented below:

blogs_txt[stri_count_words(blogs_txt)==0]

## character(0)

blogs_txt[stri_count_words(blogs_txt)==1][4:8]

## [1] "~ heehee ~" "Sickos"     "Jerry"      "Camila"     "Margaret"

blogs_txt[stri_count_words(blogs_txt)==2][2:4]

## [1] "You die?"            "Furioso dreadnought" "Stella Telleria"

blogs_txt[stri_count_words(blogs_txt)==3][2:4]

## [1] "M. Blakeman Ingle" "1 Beauty Contest"  "Just go away."

As expected the summaries and figures shows that:

en_US.blogs.txt contains the most number words per line.
en_US.twitter.txt have the shortest most number words but highest number of lines.
en_US.news.txt is between the 2 other files

As we can see, the number of word per line is between 1 and 681 and as it was shown before that some important data cleaning is needed to remove punctuation, number, etc.

Cleaning of the data-sets

We will now combined the 3 data-sets into a single data-set and do an important clean up of the data-set using predefined transformations and regular expression [8-9] and the R function presented in appendix IV:

do the conversion to lower case
spit word separated by a hyphen (‘-’)
remove web site addresses
remove twitter hastag and twitter mention
remove email addresses
remove punctuation and special characters
remove number
remove “stop word”" like “the”“,”and“, etc
remove remove extra white space
profanity filtering using [10]

We can now look at the results of the data-set after the clean-up using “word clouds” that" give greater prominence to words that appear more frequently in the data-set using [11]. The results seems pretty good and some further exploration show the only remaining issue is some aggregate of words like “capitalismwith” probably from the original data-set.

Figure 4: Word cloud after the clean up of the data-set

The N-Gram analysis

In Natural Language Processing (NLP) [12-13] an n-gram is a contiguous sequence of n items from a given sequence of text or speech. We will extract uni-gram (1-grams), bi-gram (2-grams) and trig-gram (3-grams) from the cleaned text corpus with the following functions presented in appendix V.

Uni-gram

The Figure 5 below show the top 10 most frequent words occurring in the data-set. We can see that this is consistent with “word cloud” from Figure 4.

Figure 5: The top 10 most frequent uni-grams

Bi-gram

The Figure 6 below show the top 10 most bi-grams occurring in the data-set. The result seems reasonable and will highly depend of the data-set used.

Figure 6: The top 10 most frequent bi-grams

Tri-gram

The Figure 7 below show the top 10 most tri-grams occurring in the data-set. As for the bi-grams, the result seems reasonable and will highly depend of the data-set used.

Figure 7: The top 10 most frequent tri-grams

Interesting findings

We summarize below some interesting findings that seems to be some key points for the future application:

loading the the full data-set costs a lot of time
file size is quite important so the processing is therefor important
cleaning of the data-set is a key step and may need some additional tuning
better understanding of the n-gram creation
depending of prediction model we may need to revisit the data-set cleaning and n-gram creation

We learn a lot and more work is still needed since there is many options and we should try to find the optimal one. For a Shiny web application, the key point is to be fast so we will have to optimize all steps by using optimal R tools to run on the full data-set or randomly select a unbiased samples form the full data-set or store in a local file the clean data since this step will be identical if the training data-set stay the same.

Next steps for the Shiny prediction application

The final prediction application will be implemented using Shiny. The predictive text model will be based on the work we did so far but we will have to use the optimal one. The next steps for this project will consist in:

optimize the work so far to get the best performance
implement a prediction model
use training set, validation set and test set to estimate accuracy
implement a prediction application using Shiny
how to deal with combination unknown words
heavy testing of the application

Conclusion

In conclusion we learn how to clean the data-set to select a final list of words and to create n-grams. For a Shiny web application, the key point is to be fast so we will have to optimize all steps by using optimal R tools, use sub sample form the full data-set or local file with final clean data-sets. There is many other way to do this project. We will now optimize the actual code and learn how to build a prediction model. For expert the, the list of libraries sessionInfo can be found in appendix VI and appendix VII.

References

[1] HC Corpora. www.corpora.heliohost.org/ .

[2] R Core Team (2014), R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. www.R-project.org/.

[3] Yihui Xie (2015). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.11.

[4] Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

[5] Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595.

[6] Package ‘ggplot2’. http://ggplot2.org/ .

[7] Package ‘stringi’. https://cran.r-project.org/web/packages/stringi/index.html .

[8] Ingo Feinerer, Kurt Hornik, David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software.

[9] CRAN Task View: Natural Language Processing. https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

[10] List of profanity words. https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words .

[11] Package ‘wordcloud’. http://blog.fellstat.com/?cat=11 .

[12] Natural language processing (Wikipedia). https://en.wikipedia.org/wiki/Natural_language_processing/ .

[13] N-gram (Wikipedia). https://en.wikipedia.org/wiki/N-gram/ .

Appendix

Appendix I: Download the file

# the directory with the data files should be in working directory
directory_present<-(file.exists("final"))

# if not already done we will unzip the file
if (!directory_present) {
        # dowbload the file from the Coursera website
        fileURL <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
        download.file(fileURL, destfile = "file.zip", method = "curl")
        # unzip the zip file
        unzip("file.zip")
}

Appendix II: Compute basic statistic of a file

# compute the size of the file and the number of line of a file
df <- data.frame(Name=character(), Size=numeric(), Line=integer(),
                 LineNempty=integer(), Word=integer(),stringsAsFactors = FALSE)
summaryFileInfo <- function(file_txt,variable_name_txt,df){
        path_file<-paste("./final",category,category,".",gsub("[_]",".",variable_name_txt),sep="")
        name_file<-paste(gsub("[/]","",category),".",gsub("[_]",".",variable_name_txt),sep="")
        size_file<-file.info(path_file)$size/1024^2;line_file<-length(file_txt)
        line_fileNempty<-stri_stats_general(file_txt)[["LinesNEmpty"]]
        word_file<-sum(sapply(gregexpr("\\S+", file_txt), length))
        df<-rbind(df,data.frame(name_file,size_file,line_file,line_fileNempty,word_file))
        return (df)
}
df<-summaryFileInfo(blogs_txt,"blogs_txt",df)
df<-summaryFileInfo(news_txt,"news_txt",df)
df<-summaryFileInfo(twitter_txt,"twitter_txt",df)
colnames(df)<-c("File Name","File size[MB]","Number of lines","Number of non empty lines","Number of words")

Appendix III: Compute basic statistic of the distribution of the number of word per line

df_summary <- data.frame(Name=character(),Min=double(),FQ=double(),Med=double(),
                         Mean=double(),TQ=double(),Max=double(),stringsAsFactors = FALSE)
summaryWord <- function(file_txt,variable_name_txt,df_summary){
        name_file<-paste(gsub("[/]","",category),".",gsub("[_]",".",variable_name_txt),sep="")
        min_word<-summary(stri_count_words(file_txt))[[1]]
        fq_word<-summary(stri_count_words(file_txt))[[2]]
        med_word<-summary(stri_count_words(file_txt))[[3]]
        mean_word<-summary(stri_count_words(file_txt))[[4]]
        tq_word<-summary(stri_count_words(file_txt))[[5]]
        max_word<-summary(stri_count_words(file_txt))[[6]]
        df_summary<-rbind(df_summary,data.frame(name_file,min_word,fq_word,med_word,mean_word,tq_word,max_word))
        return (df_summary)
}
df_summary<-summaryWord(blogs_txt,"blogs_txt",df_summary)
df_summary<-summaryWord(news_txt,"news_txt",df_summary)
df_summary<-summaryWord(twitter_txt,"twitter_txt",df_summary)
colnames(df_summary)<-c("File Name","Minimum","1st Quartile","Median","Mean","3rd Quartile","Maximum")

Appendix IV: Clean up of the dataset

# the directory with the data files should be in working directory
another_directory_present<-(file.exists("List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master"))

# if not already done we will unzip the file
if (!another_directory_present) {
        # dowbload the file from the Coursera website
        fileURL <- "https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.zip"
        download.file(fileURL,destfile="List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master.zip", method = "curl")
        # unzip the zip file
        unzip("List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master.zip")
}

# get the list of sanity words
profanity <- read.csv("./List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master/en", header = F)
profanity <- rep(profanity$V1)

# combined the 3 data-sets
combined_txt <- paste(blogs_txt,twitter_txt,blogs_txt)

# create a corpus document and clean the data as mention below
clean_txt_corpus <- VCorpus(VectorSource(combined_txt))
clean_txt_corpus <- tm_map(clean_txt_corpus, PlainTextDocument)
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(tolower))      # conversion to lower case
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(
        function(x) {gsub("-", " ", x)}))                                       # spit word separated by a "-"
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(
        function(x) {gsub("http\\S+\\s*", " ", x)}))                            # remove web site addresses
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(
        function(x) {gsub("www\\S+\\s*", " ", x)}))                             # remove web site addresses
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer( 
        function(x) {gsub("#\\S+"," ", x)}))                                    # remove twitter hastag
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer( 
        function(x) {gsub("@\\S+"," ", x)}))                                    # remove twitter mention
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer( 
        function(x) {gsub("\\S+@\\S+", " ", x)}))                               # remove email addresse
clean_txt_corpus <- tm_map(clean_txt_corpus, removePunctuation)                 # remove punctuation
clean_txt_corpus <- tm_map(clean_txt_corpus, removeNumbers)                     # remove number
clean_txt_corpus <- tm_map(clean_txt_corpus, removeWords, stopwords("english")) # remove stop word
clean_txt_corpus <- tm_map(clean_txt_corpus, stripWhitespace)                   # remove exra white space
clean_txt_corpus <- tm_map(clean_txt_corpus, stemDocument, language="english")  # stem words
clean_txt_corpus <- tm_map(clean_txt_corpus, removeWords, profanity)            # remove profanity words

Appendix V: Create N grams: unigram, bigram and trigram

options( java.parameters = "-Xmx4g" )   # to avoid some errors !!
options(mc.cores=1)                     # to avoid some errors !!

NgramBuilder <- function(df, N) {
        NgramFunction <- NGramTokenizer(df,Weka_control(min = N, max = N))
        return(NgramFunction)
}
# transform corpus data in a data frame
df_clean_txt_corpus <- data.frame(text=unlist(sapply(clean_txt_corpus,'[',"content")),stringsAsFactors=FALSE)

unigram <- data.frame(table(NgramBuilder(df_clean_txt_corpus,1)))
unigram <- unigram[order(unigram$Freq,decreasing=TRUE),]
colnames(unigram) <- c("Term", "Freq")

bigram <- data.frame(table(NgramBuilder(df_clean_txt_corpus,2)))
bigram <- bigram[order(bigram$Freq,decreasing=TRUE),]
colnames(bigram) <- c("Term", "Freq")

trigram <- data.frame(table(NgramBuilder(df_clean_txt_corpus,3)))
trigram <- trigram[order(trigram$Freq,decreasing=TRUE),]
colnames(trigram) <- c("Term", "Freq")

Appendix VI: Library needed

library(knitr, warn.conflicts = FALSE, quietly=TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly=TRUE)
library(stringi, warn.conflicts = FALSE, quietly=TRUE)
library(tm, warn.conflicts = FALSE, quietly=TRUE)
library(SnowballC, warn.conflicts = FALSE, quietly=TRUE)
library(wordcloud, warn.conflicts = FALSE, quietly=TRUE)
library(RWeka, warn.conflicts = FALSE, quietly=TRUE)
library(rJava, warn.conflicts = FALSE, quietly=TRUE)

Appendix VII: Info of the session

## for some experts this could be useful
sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dplyr_0.4.3        Rmisc_1.5          plyr_1.8.3        
##  [4] lattice_0.20-33    rJava_0.9-8        RWeka_0.4-25      
##  [7] wordcloud_2.5      RColorBrewer_1.1-2 SnowballC_0.5.1   
## [10] tm_0.6-2           NLP_0.1-9          stringi_1.0-1     
## [13] ggplot2_2.1.0      knitr_1.12.3      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3        magrittr_1.5       RWekajars_3.7.13-1
##  [4] munsell_0.4.3      colorspace_1.2-6   R6_2.1.2          
##  [7] highr_0.5.1        stringr_1.0.0      tools_3.2.2       
## [10] parallel_3.2.2     grid_3.2.2         gtable_0.2.0      
## [13] DBI_0.3.1          htmltools_0.3      assertthat_0.1    
## [16] yaml_2.1.13        digest_0.6.9       formatR_1.3       
## [19] evaluate_0.8.3     slam_0.1-32        rmarkdown_0.9.5   
## [22] labeling_0.3       scales_0.4.0