This is a milestone report of the Data Science Specialization SwiftKey Capstone. The aim of this project is to apply data science in the area of natural language processing using a data-set from a corpus called HC Corpora[1]. The goal is create a data product by creating a predictive text model that can predict the next word that the user will write using a smart keyboard.
The basic goal of this milestone report using R[2] is to load and clean the data that will be used later to create a predictive model, to do some basic exploratory analysis of the data, to report any interesting findings and to get feedback on our plans for creating a prediction algorithm and Shiny app. This report is a R markdown document that can be processed by knitr[3-5]. This report contain a summary, 7 figures, 2 tables and to be appreciated by a non-data scientist manager complex R code are only available in the appendix.
In conclusion we learn how to clean the data-set to select a final list of words and to create n-grams. For a Shiny web application, the key point is to be fast so we will have to optimize all steps by using optimal R tools, use sub sample form the full data-set or local file with final clean data-sets.
The data-set come from a corpus called HC Corpora[1] and can be downloaded from the Coursera link: Coursera-SwiftKey [1,41 GB] using the R code of appendix I.
This is the training data-set for the Data Science Specialization SwiftKey Capstone project. There is also some documentation on the Corpora available here: Readme and the files have been language filtered but may still contain some foreign text.
The data-set from [1] can be be read as follows:
# choose the category of files to read : en_US
category<-"/en_US"
lines=10000 # for all lines use -1L
# read the corresponding 3 files
blogs_txt <- readLines(paste("./final",category,category,".blogs.txt",sep=""), skipNul=TRUE, n =lines)
news_txt <- readLines(paste("./final",category,category,".news.txt",sep=""), skipNul=TRUE, n =lines)
twitter_txt <- readLines(paste("./final",category,category,".twitter.txt",sep=""), skipNul=TRUE, n =lines)
The function presented in appendix II provide a basic summary of a file: file size, number of line, number of non empty lines and number of word.
In the Table I a basic summary of the 3 data-sets contained in the English en_US folder is presented :
| File Name | File size[MB] | Number of lines | Number of non empty lines | Number of words |
|---|---|---|---|---|
| en_US.blogs.txt | 200.4242 | 10000 | 10000 | 410620 |
| en_US.news.txt | 196.2775 | 10000 | 10000 | 343929 |
| en_US.twitter.txt | 159.3641 | 10000 | 10000 | 127674 |
These 3 data-sets : en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt contain respectively 0.9 million, 1 million, and 2.4 millions of lines of text which is quite important and we need to take this into account for the design of the application. For the data-set exploration we took a sub-set of each file but later we can either optimize the R code to run on the full data-set or randomly select a unbiased samples form the full data-set.
Figure 1: Number of words per line in en_US.blogs.txt
Figure 2: Number of words per line in en_US.news.txt
Figure 3: Number of words per line in en_US.twitter.txt
The function presented in appendix III provide a basic statistic summary of the distribution of the number of word per line using [7]: minimum, 1st quartile, median, mean, 3rd quartile and maximum.
In the Table II a the basic summary of the word count for the 3 data-sets is presented :
| File Name | Minimum | 1st Quartile | Median | Mean | 3rd Quartile | Maximum |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | 1 | 9 | 28 | 41.28 | 59 | 681 |
| en_US.news.txt | 1 | 19 | 32 | 34.81 | 46 | 302 |
| en_US.twitter.txt | 1 | 7 | 12 | 12.65 | 18 | 34 |
Let have a look at random example of word with 0, 1, 2 and 3 characters as presented below:
blogs_txt[stri_count_words(blogs_txt)==0]
## character(0)
blogs_txt[stri_count_words(blogs_txt)==1][4:8]
## [1] "~ heehee ~" "Sickos" "Jerry" "Camila" "Margaret"
blogs_txt[stri_count_words(blogs_txt)==2][2:4]
## [1] "You die?" "Furioso dreadnought" "Stella Telleria"
blogs_txt[stri_count_words(blogs_txt)==3][2:4]
## [1] "M. Blakeman Ingle" "1 Beauty Contest" "Just go away."
As expected the summaries and figures shows that:
en_US.blogs.txt contains the most number words per line.
en_US.twitter.txt have the shortest most number words but highest number of lines.
en_US.news.txt is between the 2 other files
As we can see, the number of word per line is between 1 and 681 and as it was shown before that some important data cleaning is needed to remove punctuation, number, etc.
We will now combined the 3 data-sets into a single data-set and do an important clean up of the data-set using predefined transformations and regular expression [8-9] and the R function presented in appendix IV:
We can now look at the results of the data-set after the clean-up using “word clouds” that" give greater prominence to words that appear more frequently in the data-set using [11]. The results seems pretty good and some further exploration show the only remaining issue is some aggregate of words like “capitalismwith” probably from the original data-set.
Figure 4: Word cloud after the clean up of the data-set
In Natural Language Processing (NLP) [12-13] an n-gram is a contiguous sequence of n items from a given sequence of text or speech. We will extract uni-gram (1-grams), bi-gram (2-grams) and trig-gram (3-grams) from the cleaned text corpus with the following functions presented in appendix V.
The Figure 5 below show the top 10 most frequent words occurring in the data-set. We can see that this is consistent with “word cloud” from Figure 4.
Figure 5: The top 10 most frequent uni-grams
The Figure 6 below show the top 10 most bi-grams occurring in the data-set. The result seems reasonable and will highly depend of the data-set used.
Figure 6: The top 10 most frequent bi-grams
The Figure 7 below show the top 10 most tri-grams occurring in the data-set. As for the bi-grams, the result seems reasonable and will highly depend of the data-set used.
Figure 7: The top 10 most frequent tri-grams
We summarize below some interesting findings that seems to be some key points for the future application:
We learn a lot and more work is still needed since there is many options and we should try to find the optimal one. For a Shiny web application, the key point is to be fast so we will have to optimize all steps by using optimal R tools to run on the full data-set or randomly select a unbiased samples form the full data-set or store in a local file the clean data since this step will be identical if the training data-set stay the same.
The final prediction application will be implemented using Shiny. The predictive text model will be based on the work we did so far but we will have to use the optimal one. The next steps for this project will consist in:
In conclusion we learn how to clean the data-set to select a final list of words and to create n-grams. For a Shiny web application, the key point is to be fast so we will have to optimize all steps by using optimal R tools, use sub sample form the full data-set or local file with final clean data-sets. There is many other way to do this project. We will now optimize the actual code and learn how to build a prediction model. For expert the, the list of libraries sessionInfo can be found in appendix VI and appendix VII.
[1] HC Corpora. www.corpora.heliohost.org/ .
[2] R Core Team (2014), R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. www.R-project.org/.
[3] Yihui Xie (2015). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.11.
[4] Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
[5] Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595.
[6] Package ‘ggplot2’. http://ggplot2.org/ .
[7] Package ‘stringi’. https://cran.r-project.org/web/packages/stringi/index.html .
[8] Ingo Feinerer, Kurt Hornik, David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software.
[9] CRAN Task View: Natural Language Processing. https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
[10] List of profanity words. https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words .
[11] Package ‘wordcloud’. http://blog.fellstat.com/?cat=11 .
[12] Natural language processing (Wikipedia). https://en.wikipedia.org/wiki/Natural_language_processing/ .
[13] N-gram (Wikipedia). https://en.wikipedia.org/wiki/N-gram/ .
# the directory with the data files should be in working directory
directory_present<-(file.exists("final"))
# if not already done we will unzip the file
if (!directory_present) {
# dowbload the file from the Coursera website
fileURL <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, destfile = "file.zip", method = "curl")
# unzip the zip file
unzip("file.zip")
}
# compute the size of the file and the number of line of a file
df <- data.frame(Name=character(), Size=numeric(), Line=integer(),
LineNempty=integer(), Word=integer(),stringsAsFactors = FALSE)
summaryFileInfo <- function(file_txt,variable_name_txt,df){
path_file<-paste("./final",category,category,".",gsub("[_]",".",variable_name_txt),sep="")
name_file<-paste(gsub("[/]","",category),".",gsub("[_]",".",variable_name_txt),sep="")
size_file<-file.info(path_file)$size/1024^2;line_file<-length(file_txt)
line_fileNempty<-stri_stats_general(file_txt)[["LinesNEmpty"]]
word_file<-sum(sapply(gregexpr("\\S+", file_txt), length))
df<-rbind(df,data.frame(name_file,size_file,line_file,line_fileNempty,word_file))
return (df)
}
df<-summaryFileInfo(blogs_txt,"blogs_txt",df)
df<-summaryFileInfo(news_txt,"news_txt",df)
df<-summaryFileInfo(twitter_txt,"twitter_txt",df)
colnames(df)<-c("File Name","File size[MB]","Number of lines","Number of non empty lines","Number of words")
df_summary <- data.frame(Name=character(),Min=double(),FQ=double(),Med=double(),
Mean=double(),TQ=double(),Max=double(),stringsAsFactors = FALSE)
summaryWord <- function(file_txt,variable_name_txt,df_summary){
name_file<-paste(gsub("[/]","",category),".",gsub("[_]",".",variable_name_txt),sep="")
min_word<-summary(stri_count_words(file_txt))[[1]]
fq_word<-summary(stri_count_words(file_txt))[[2]]
med_word<-summary(stri_count_words(file_txt))[[3]]
mean_word<-summary(stri_count_words(file_txt))[[4]]
tq_word<-summary(stri_count_words(file_txt))[[5]]
max_word<-summary(stri_count_words(file_txt))[[6]]
df_summary<-rbind(df_summary,data.frame(name_file,min_word,fq_word,med_word,mean_word,tq_word,max_word))
return (df_summary)
}
df_summary<-summaryWord(blogs_txt,"blogs_txt",df_summary)
df_summary<-summaryWord(news_txt,"news_txt",df_summary)
df_summary<-summaryWord(twitter_txt,"twitter_txt",df_summary)
colnames(df_summary)<-c("File Name","Minimum","1st Quartile","Median","Mean","3rd Quartile","Maximum")
# the directory with the data files should be in working directory
another_directory_present<-(file.exists("List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master"))
# if not already done we will unzip the file
if (!another_directory_present) {
# dowbload the file from the Coursera website
fileURL <- "https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.zip"
download.file(fileURL,destfile="List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master.zip", method = "curl")
# unzip the zip file
unzip("List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master.zip")
}
# get the list of sanity words
profanity <- read.csv("./List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master/en", header = F)
profanity <- rep(profanity$V1)
# combined the 3 data-sets
combined_txt <- paste(blogs_txt,twitter_txt,blogs_txt)
# create a corpus document and clean the data as mention below
clean_txt_corpus <- VCorpus(VectorSource(combined_txt))
clean_txt_corpus <- tm_map(clean_txt_corpus, PlainTextDocument)
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(tolower)) # conversion to lower case
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(
function(x) {gsub("-", " ", x)})) # spit word separated by a "-"
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(
function(x) {gsub("http\\S+\\s*", " ", x)})) # remove web site addresses
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(
function(x) {gsub("www\\S+\\s*", " ", x)})) # remove web site addresses
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(
function(x) {gsub("#\\S+"," ", x)})) # remove twitter hastag
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(
function(x) {gsub("@\\S+"," ", x)})) # remove twitter mention
clean_txt_corpus <- tm_map(clean_txt_corpus, content_transformer(
function(x) {gsub("\\S+@\\S+", " ", x)})) # remove email addresse
clean_txt_corpus <- tm_map(clean_txt_corpus, removePunctuation) # remove punctuation
clean_txt_corpus <- tm_map(clean_txt_corpus, removeNumbers) # remove number
clean_txt_corpus <- tm_map(clean_txt_corpus, removeWords, stopwords("english")) # remove stop word
clean_txt_corpus <- tm_map(clean_txt_corpus, stripWhitespace) # remove exra white space
clean_txt_corpus <- tm_map(clean_txt_corpus, stemDocument, language="english") # stem words
clean_txt_corpus <- tm_map(clean_txt_corpus, removeWords, profanity) # remove profanity words
options( java.parameters = "-Xmx4g" ) # to avoid some errors !!
options(mc.cores=1) # to avoid some errors !!
NgramBuilder <- function(df, N) {
NgramFunction <- NGramTokenizer(df,Weka_control(min = N, max = N))
return(NgramFunction)
}
# transform corpus data in a data frame
df_clean_txt_corpus <- data.frame(text=unlist(sapply(clean_txt_corpus,'[',"content")),stringsAsFactors=FALSE)
unigram <- data.frame(table(NgramBuilder(df_clean_txt_corpus,1)))
unigram <- unigram[order(unigram$Freq,decreasing=TRUE),]
colnames(unigram) <- c("Term", "Freq")
bigram <- data.frame(table(NgramBuilder(df_clean_txt_corpus,2)))
bigram <- bigram[order(bigram$Freq,decreasing=TRUE),]
colnames(bigram) <- c("Term", "Freq")
trigram <- data.frame(table(NgramBuilder(df_clean_txt_corpus,3)))
trigram <- trigram[order(trigram$Freq,decreasing=TRUE),]
colnames(trigram) <- c("Term", "Freq")
library(knitr, warn.conflicts = FALSE, quietly=TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly=TRUE)
library(stringi, warn.conflicts = FALSE, quietly=TRUE)
library(tm, warn.conflicts = FALSE, quietly=TRUE)
library(SnowballC, warn.conflicts = FALSE, quietly=TRUE)
library(wordcloud, warn.conflicts = FALSE, quietly=TRUE)
library(RWeka, warn.conflicts = FALSE, quietly=TRUE)
library(rJava, warn.conflicts = FALSE, quietly=TRUE)
## for some experts this could be useful
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.4.3 Rmisc_1.5 plyr_1.8.3
## [4] lattice_0.20-33 rJava_0.9-8 RWeka_0.4-25
## [7] wordcloud_2.5 RColorBrewer_1.1-2 SnowballC_0.5.1
## [10] tm_0.6-2 NLP_0.1-9 stringi_1.0-1
## [13] ggplot2_2.1.0 knitr_1.12.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 magrittr_1.5 RWekajars_3.7.13-1
## [4] munsell_0.4.3 colorspace_1.2-6 R6_2.1.2
## [7] highr_0.5.1 stringr_1.0.0 tools_3.2.2
## [10] parallel_3.2.2 grid_3.2.2 gtable_0.2.0
## [13] DBI_0.3.1 htmltools_0.3 assertthat_0.1
## [16] yaml_2.1.13 digest_0.6.9 formatR_1.3
## [19] evaluate_0.8.3 slam_0.1-32 rmarkdown_0.9.5
## [22] labeling_0.3 scales_0.4.0