Introduction

The project goal is to create a data product (Shiny app) that utilizes an NLP algorithm to predict the next word given an input phrase.
This report provides the exploratory analysis of the training data, summary statistics, and describes our plans for creating a prediction algorithm and Shiny app.

Data

For more simplicity and brevity, certain trivial activities (e.g. download from WEB, unpacking, etc) are not covered in this report.

According to the project instructions, we use en_US locale text data from the HC Corpora corpus for training.
The complete data set (four locales) is available at the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Data Source

The training data for the en_US locale is represented by the following three text files (size in MB):

file_names <- dir( "txt" )
file_sizes <- file.size( file.path("txt", file_names) ) / (2^10)^2
names( file_sizes ) <- file_names ; file_sizes
##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##          200.4242          196.2775          159.3641

Basic Statistics

Here are the summary statistics about the loaded data sets e.g. number of lines, minimum and maximum length of lines, total number of words (matching the regular expression pattern [A-Za-z]) and the number of unique words:

library(tm) ; library(RWeka) ; library(plotly)    # load the libraries 
conn1 <- file("txt/en_US.blogs.txt"   , "rb")
conn2 <- file("txt/en_US.news.txt"    , "rb")
conn3 <- file("txt/en_US.twitter.txt" , "rb")

txtBlogs   <- readLines(conn1 , encoding="UTF-8" , skipNul=TRUE)
txtNews    <- readLines(conn2 , encoding="UTF-8" , skipNul=TRUE)
txtTwitter <- readLines(conn3 , encoding="UTF-8" , skipNul=TRUE)

close(conn1) ; close(conn2) ; close(conn3)

lstDSet <- list(txtBlogs,txtNews,txtTwitter) ; lstDLength <- sapply(lstDSet, nchar)
dfSummary <- data.frame( lines = sapply( lstDSet, length ) )
dfSummary$min_char <- sapply(lstDLength, min) ; dfSummary$max_char <- sapply(lstDLength, max)

wordsAll <- sapply( lstDSet, function(x) unlist(strsplit(gsub("[^A-Za-z]"," ",x),split="[ ]+")) )
wordsUnique <- sapply( wordsAll, function(x) sort( table(x), decreasing=TRUE ) )

dfSummary$words.Total <- sapply(wordsAll,length) ; dfSummary$words.Unique <- sapply(wordsUnique,length)
row.names(dfSummary) <- c("blogs","news","twitter") ; dfSummary
##           lines min_char max_char words.Total words.Unique
## blogs    899288        1    40833    37987415       326419
## news    1010242        1    11384    34811459       260062
## twitter 2360148        2      140    30721179       414007

Features of the Data

The following bar plots show 15 most frequent words in each domain (in blogs, news and twitter).

par( mfcol=c(1,3) , mar=c(5,4,2,2) , las=1 , oma=c(0,0,2,0) , cex.lab=1.5 )
barplot( wordsUnique[[1]][1:15], horiz=TRUE, main="in BLOGS", cex.names=1.3 )
barplot( wordsUnique[[2]][1:15], horiz=TRUE, main="in NEWS", cex.names=1.3 , xlab="Frequency" )
barplot( wordsUnique[[3]][1:15], horiz=TRUE, main="in TWITTER", cex.names=1.3 )
title( main="15 Most Frequent Words", outer=TRUE, cex.main=1.5 )
box( which="outer", lty="solid" )

Getting and Cleaning the Data

The above result (e.g. presence of stopwords and upper case letters) suggests that the data needs to be cleaned and transformed to better fit our purpose.
Also, considering the large volume of data, we will proceed with a representative sample i.e. with random subsets created by sampling 10% of text lines from the original data set.

Sampling

Here are the summary statistics about the data obtained via sampling.

set.seed(123) ; smplBlogs <- sample( txtBlogs, size=length(txtBlogs)*0.1, replace=FALSE )
set.seed(123) ; smplNews <- sample( txtNews, size=length(txtNews) *0.1, replace=FALSE )
set.seed(123) ; smplTwitter <- sample( txtTwitter, size=length(txtTwitter)*0.1, replace=FALSE )

lstSample <- list(smplBlogs,smplNews,smplTwitter) ; lstSLength <- sapply(lstSample, nchar)
smplSummary <- data.frame( lines = sapply( lstSample, length ) )
smplSummary$min_char <- sapply(lstSLength, min) ; smplSummary$max_char <- sapply(lstSLength, max)

wordsAllSMPL <- sapply( lstSample, function(x) unlist(strsplit(gsub("[^A-Za-z]"," ",x),split="[ ]+")) )
wordsUniqueSMPL <- sapply( wordsAllSMPL, function(x) sort( table(x), decreasing=TRUE ) )

smplSummary$words.Total <- sapply( wordsAllSMPL , length )
smplSummary$words.Unique <- sapply( wordsUniqueSMPL ,length )
row.names(smplSummary) <- c("blogs","news","twitter") ; smplSummary
##          lines min_char max_char words.Total words.Unique
## blogs    89928        1    12409     3788773       106261
## news    101024        2     2386     3477094        97132
## twitter 236014        2      140     3070967       114763

Features of the Representative Sample

The following bar plots show the frequency of words after sampling (i.e. in the representative samples).
From comparing the bar plots (original vs sampled data) we can conclude that the sampling provides an accurate approximation of the original distribution of words (at least, the most frequently occurring ones).

par( mfcol=c(1,3) , mar=c(5,4,2,2) , las=1 , oma=c(0,0,2,0) , cex.lab=1.5 )
barplot( wordsUniqueSMPL[[1]][1:15], horiz=TRUE, main="in BLOGS", cex.names=1.3 )
barplot( wordsUniqueSMPL[[2]][1:15], horiz=TRUE, main="in NEWS", cex.names=1.3 , xlab="Frequency" )
barplot( wordsUniqueSMPL[[3]][1:15], horiz=TRUE, main="in TWITTER", cex.names=1.3 )
title( main="15 Most Frequent Words in Sampled Data", outer=TRUE, cex.main=1.5 )
box( which="outer", lty="solid" )


Findings

The following interactive plot is based on the frequencies of unique words in the BLOGS sampled data (total words 3788773, unique words 106261). This is a kind of cumulative distribution function that shows how many unique words are needed in a frequency sorted dictionary to cover a certain part of the corpus.
As shown in the bar plots above, the most frequent words are mostly represented by stopwords.

dfCDF <- data.frame( coverage=cumsum(prop.table( wordsUniqueSMPL[[1]] )), words=seq_along( wordsUniqueSMPL[[1]] ) )

plot_ly(  dfCDF ) %>% add_trace( x=~words, y=~coverage , type = 'scatter', mode="lines" , name="" ) %>% add_markers( x=c( which( dfCDF$coverage >= 0.5 )[1] , which( dfCDF$coverage >= 0.9 )[1] ) , y=c( dfCDF$coverage[ which( dfCDF$coverage >= 0.5 )[1] ] , dfCDF$coverage[ which( dfCDF$coverage >= 0.9 )[1] ] ) , text=c( "50% coverage" , "90% coverage" ) , showlegend = FALSE , name="" ) %>% layout(hovermode = "y unified")

Interestingly, the first 126 unique words cover 50% of the total number of words in the BLOGS corpus, and 8095 unique words cover 90%, respectively.

Cleaning

At this stage of the project, we use only the bare minimum of data cleaning e.g. convert uppercase to lowercase, remove numbers, punctuation and stopwords, and strip extra whitespace.
Additional cleaning functionality (e.g. profanity filtering) can be introduced later as needed.

corpBlogs <- VCorpus( VectorSource(smplBlogs) )
corpBlogs <- tm_map( corpBlogs , content_transformer( tolower ) )
corpBlogs <- tm_map( corpBlogs , removeNumbers  )
corpBlogs <- tm_map( corpBlogs , removePunctuation , ucp = TRUE )
corpBlogs <- tm_map( corpBlogs , removeWords , removePunctuation(stopwords("english")) )
corpBlogs <- tm_map( corpBlogs , stripWhitespace )

corpNews <- VCorpus( VectorSource(smplNews) )
corpNews <- tm_map( corpNews , content_transformer( tolower ) )
corpNews <- tm_map( corpNews , removeNumbers  )
corpNews <- tm_map( corpNews , removePunctuation , ucp = TRUE )
corpNews <- tm_map( corpNews , removeWords , removePunctuation(stopwords("english")) )
corpNews <- tm_map( corpNews , stripWhitespace )

corpTwitter <- VCorpus( VectorSource(smplTwitter) )
corpTwitter <- tm_map( corpTwitter , content_transformer( tolower ) )
corpTwitter <- tm_map( corpTwitter , removeNumbers  )
corpTwitter <- tm_map( corpTwitter , removePunctuation , ucp = TRUE )
corpTwitter <- tm_map( corpTwitter , removeWords , removePunctuation(stopwords("english")) )
corpTwitter <- tm_map( corpTwitter , stripWhitespace )

The cleaning procedure creates 3 corpora (the collections of documents) containing text from Blogs, News and Twitter.
The following examples illustrate the change to the content.

Blogs

c( smplBlogs[2] , corpBlogs[[2]]$content )
## [1] "Walden Pond, Mt. Rainier, Big Sur, Everglades and so forth;"
## [2] "walden pond mt rainier big sur everglades forth"

News

c( smplNews[10] , corpNews[[10]]$content )
## [1] "Kids bobbed nearby, clutching the pool's edges, waiting for Chapman's verdict."
## [2] "kids bobbed nearby clutching pools edges waiting chapmans verdict"

Twitter

c( smplTwitter[1] , corpTwitter[[1]]$content )
## [1] "just wanted to thank you & ask what got you started on your mission?"
## [2] "just wanted thank ask got started mission"

Exploratory Data Analysis

The purpose of this section is to understand frequencies of words and word pairs. To achieve this, we use tokenization (NGramTokenizer by WEKA) to split our corpora documents (strings) into three N-grams: a unigram, a bigram, and a trigram.

Blogs

The following bar plot shows the frequencies of words and word pairs in BLOGS.

BlogsTDM1 <- TermDocumentMatrix(corpBlogs, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=1,max=1))))
BlogsTDM2 <- TermDocumentMatrix(corpBlogs, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=2,max=2))))
BlogsTDM3 <- TermDocumentMatrix(corpBlogs, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=3,max=3))))

BlogsTDM1 <- removeSparseTerms(BlogsTDM1, 0.9999)
BlogsTDM2 <- removeSparseTerms(BlogsTDM2, 0.9999)
BlogsTDM3 <- removeSparseTerms(BlogsTDM3, 0.9999)

Blogs1Freq <- sort( rowSums(as.matrix(BlogsTDM1)), decreasing = TRUE )
Blogs2Freq <- sort( rowSums(as.matrix(BlogsTDM2)), decreasing = TRUE )
Blogs3Freq <- sort( rowSums(as.matrix(BlogsTDM3)), decreasing = TRUE )
par( mfcol=c(1,3) , mar=c(5,8,2,2) , las=1 , oma=c(0,0,2,0) , cex.lab=1.5 )
barplot( Blogs1Freq[1:15], horiz=TRUE, main="Unigrams", cex.names=1.3 )
barplot( Blogs2Freq[1:15], horiz=TRUE, main="Bigrams", cex.names=1.3 , xlab="Frequency" )
barplot( Blogs3Freq[1:15], horiz=TRUE, main="Trigrams", cex.names=1.0 )
title( main="15 Most Frequent N-grams in BLOGS", outer=TRUE, cex.main=1.5 )
box( which="outer", lty="solid" )

News

The following bar plot shows the frequencies of words and word pairs in NEWS.

NewsTDM1 <- TermDocumentMatrix(corpNews, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=1,max=1))))
NewsTDM2 <- TermDocumentMatrix(corpNews, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=2,max=2))))
NewsTDM3 <- TermDocumentMatrix(corpNews, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=3,max=3))))

NewsTDM1 <- removeSparseTerms(NewsTDM1, 0.9999)
NewsTDM2 <- removeSparseTerms(NewsTDM2, 0.9999)
NewsTDM3 <- removeSparseTerms(NewsTDM3, 0.9999)

News1Freq <- sort( rowSums(as.matrix(NewsTDM1)), decreasing = TRUE )
News2Freq <- sort( rowSums(as.matrix(NewsTDM2)), decreasing = TRUE )
News3Freq <- sort( rowSums(as.matrix(NewsTDM3)), decreasing = TRUE )
par( mfcol=c(1,3) , mar=c(5,8,2,2) , las=1 , oma=c(0,0,2,0) , cex.lab=1.5 )
barplot( News1Freq[1:15], horiz=TRUE, main="Unigrams", cex.names=1.3 )
barplot( News2Freq[1:15], horiz=TRUE, main="Bigrams", cex.names=1.3 , xlab="Frequency" )
barplot( News3Freq[1:15], horiz=TRUE, main="Trigrams", cex.names=1.0 )
title( main="15 Most Frequent N-grams in NEWS", outer=TRUE, cex.main=1.5 )
box( which="outer", lty="solid" )

Twitter

The following bar plot shows the frequencies of words and word pairs in TWITTER.

TwitterTDM1 <- TermDocumentMatrix(corpTwitter, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=1,max=1))))
TwitterTDM2 <- TermDocumentMatrix(corpTwitter, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=2,max=2))))
TwitterTDM3 <- TermDocumentMatrix(corpTwitter, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=3,max=3))))

TwitterTDM1 <- removeSparseTerms(TwitterTDM1, 0.9999)
TwitterTDM2 <- removeSparseTerms(TwitterTDM2, 0.9999)
TwitterTDM3 <- removeSparseTerms(TwitterTDM3, 0.9999)

Twitter1Freq <- sort( rowSums(as.matrix(TwitterTDM1)), decreasing = TRUE )
Twitter2Freq <- sort( rowSums(as.matrix(TwitterTDM2)), decreasing = TRUE )
Twitter3Freq <- sort( rowSums(as.matrix(TwitterTDM3)), decreasing = TRUE )
par( mfcol=c(1,3) , mar=c(5,8,2,2) , las=1 , oma=c(0,0,2,0) , cex.lab=1.5 )
barplot( Twitter1Freq[1:15], horiz=TRUE, main="Unigrams", cex.names=1.3 )
barplot( Twitter2Freq[1:15], horiz=TRUE, main="Bigrams", cex.names=1.3 , xlab="Frequency" )
barplot( Twitter3Freq[1:15], horiz=TRUE, main="Trigrams", cex.names=1.0 )
title( main="15 Most Frequent N-grams in TWITTER", outer=TRUE, cex.main=1.5 )
box( which="outer", lty="solid" )

Further plans for Project development

Sampling has proven to be a viable solution for reducing the size of training data while providing a fairly accurate approximation of the original distribution of words.
The exploratory data analysis demonstrates a sufficient basis for creating a prediction algorithm. Using this analysis we are going to build a basic N-gram model (unigram, bigram and trigram).
The resulting N-grams have been shown to be very corpus-specific. In this regard, we plan to add an indication (parameter) of the source of the incoming phrase for greater accuracy of prediction.
As concerns unseen/unobserved N-grams, ‘backoff’ and ‘smoothing’ are possible candidates.
Building a Shiny App that includes our algorithm is the final stage of Capstone Project.