The project goal is to create a data product (Shiny app) that
utilizes an NLP algorithm to predict the next word given an input
phrase.
This report provides the exploratory analysis of the training data,
summary statistics, and describes our plans for creating a prediction
algorithm and Shiny app.
For more simplicity and brevity, certain trivial activities (e.g. download from WEB, unpacking, etc) are not covered in this report.
According to the project instructions, we use en_US locale
text data from the HC Corpora corpus for
training.
The complete data set (four locales) is available at the following link:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The training data for the en_US locale is represented by the following three text files (size in MB):
file_names <- dir( "txt" )
file_sizes <- file.size( file.path("txt", file_names) ) / (2^10)^2
names( file_sizes ) <- file_names ; file_sizes
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## 200.4242 196.2775 159.3641
Here are the summary statistics about the loaded data sets
e.g. number of lines, minimum and maximum length of lines, total number
of words (matching the regular expression pattern [A-Za-z])
and the number of unique words:
library(tm) ; library(RWeka) ; library(plotly) # load the libraries
conn1 <- file("txt/en_US.blogs.txt" , "rb")
conn2 <- file("txt/en_US.news.txt" , "rb")
conn3 <- file("txt/en_US.twitter.txt" , "rb")
txtBlogs <- readLines(conn1 , encoding="UTF-8" , skipNul=TRUE)
txtNews <- readLines(conn2 , encoding="UTF-8" , skipNul=TRUE)
txtTwitter <- readLines(conn3 , encoding="UTF-8" , skipNul=TRUE)
close(conn1) ; close(conn2) ; close(conn3)
lstDSet <- list(txtBlogs,txtNews,txtTwitter) ; lstDLength <- sapply(lstDSet, nchar)
dfSummary <- data.frame( lines = sapply( lstDSet, length ) )
dfSummary$min_char <- sapply(lstDLength, min) ; dfSummary$max_char <- sapply(lstDLength, max)
wordsAll <- sapply( lstDSet, function(x) unlist(strsplit(gsub("[^A-Za-z]"," ",x),split="[ ]+")) )
wordsUnique <- sapply( wordsAll, function(x) sort( table(x), decreasing=TRUE ) )
dfSummary$words.Total <- sapply(wordsAll,length) ; dfSummary$words.Unique <- sapply(wordsUnique,length)
row.names(dfSummary) <- c("blogs","news","twitter") ; dfSummary
## lines min_char max_char words.Total words.Unique
## blogs 899288 1 40833 37987415 326419
## news 1010242 1 11384 34811459 260062
## twitter 2360148 2 140 30721179 414007
The following bar plots show 15 most frequent words in each domain (in blogs, news and twitter).
par( mfcol=c(1,3) , mar=c(5,4,2,2) , las=1 , oma=c(0,0,2,0) , cex.lab=1.5 )
barplot( wordsUnique[[1]][1:15], horiz=TRUE, main="in BLOGS", cex.names=1.3 )
barplot( wordsUnique[[2]][1:15], horiz=TRUE, main="in NEWS", cex.names=1.3 , xlab="Frequency" )
barplot( wordsUnique[[3]][1:15], horiz=TRUE, main="in TWITTER", cex.names=1.3 )
title( main="15 Most Frequent Words", outer=TRUE, cex.main=1.5 )
box( which="outer", lty="solid" )
The above result (e.g. presence of stopwords and upper case letters)
suggests that the data needs to be cleaned and transformed to better fit
our purpose.
Also, considering the large volume of data, we will proceed with a
representative sample i.e. with random subsets created by sampling 10%
of text lines from the original data set.
Here are the summary statistics about the data obtained via sampling.
set.seed(123) ; smplBlogs <- sample( txtBlogs, size=length(txtBlogs)*0.1, replace=FALSE )
set.seed(123) ; smplNews <- sample( txtNews, size=length(txtNews) *0.1, replace=FALSE )
set.seed(123) ; smplTwitter <- sample( txtTwitter, size=length(txtTwitter)*0.1, replace=FALSE )
lstSample <- list(smplBlogs,smplNews,smplTwitter) ; lstSLength <- sapply(lstSample, nchar)
smplSummary <- data.frame( lines = sapply( lstSample, length ) )
smplSummary$min_char <- sapply(lstSLength, min) ; smplSummary$max_char <- sapply(lstSLength, max)
wordsAllSMPL <- sapply( lstSample, function(x) unlist(strsplit(gsub("[^A-Za-z]"," ",x),split="[ ]+")) )
wordsUniqueSMPL <- sapply( wordsAllSMPL, function(x) sort( table(x), decreasing=TRUE ) )
smplSummary$words.Total <- sapply( wordsAllSMPL , length )
smplSummary$words.Unique <- sapply( wordsUniqueSMPL ,length )
row.names(smplSummary) <- c("blogs","news","twitter") ; smplSummary
## lines min_char max_char words.Total words.Unique
## blogs 89928 1 12409 3788773 106261
## news 101024 2 2386 3477094 97132
## twitter 236014 2 140 3070967 114763
The following bar plots show the frequency of words after sampling
(i.e. in the representative samples).
From comparing the bar plots (original vs sampled data) we can conclude
that the sampling provides an accurate approximation of the original
distribution of words (at least, the most frequently occurring
ones).
par( mfcol=c(1,3) , mar=c(5,4,2,2) , las=1 , oma=c(0,0,2,0) , cex.lab=1.5 )
barplot( wordsUniqueSMPL[[1]][1:15], horiz=TRUE, main="in BLOGS", cex.names=1.3 )
barplot( wordsUniqueSMPL[[2]][1:15], horiz=TRUE, main="in NEWS", cex.names=1.3 , xlab="Frequency" )
barplot( wordsUniqueSMPL[[3]][1:15], horiz=TRUE, main="in TWITTER", cex.names=1.3 )
title( main="15 Most Frequent Words in Sampled Data", outer=TRUE, cex.main=1.5 )
box( which="outer", lty="solid" )
The following interactive plot is based on the frequencies of unique
words in the BLOGS sampled data (total words 3788773, unique words
106261). This is a kind of cumulative distribution function that shows
how many unique words are needed in a frequency sorted dictionary to
cover a certain part of the corpus.
As shown in the bar plots above, the most frequent words are mostly
represented by stopwords.
dfCDF <- data.frame( coverage=cumsum(prop.table( wordsUniqueSMPL[[1]] )), words=seq_along( wordsUniqueSMPL[[1]] ) )
plot_ly( dfCDF ) %>% add_trace( x=~words, y=~coverage , type = 'scatter', mode="lines" , name="" ) %>% add_markers( x=c( which( dfCDF$coverage >= 0.5 )[1] , which( dfCDF$coverage >= 0.9 )[1] ) , y=c( dfCDF$coverage[ which( dfCDF$coverage >= 0.5 )[1] ] , dfCDF$coverage[ which( dfCDF$coverage >= 0.9 )[1] ] ) , text=c( "50% coverage" , "90% coverage" ) , showlegend = FALSE , name="" ) %>% layout(hovermode = "y unified")
Interestingly, the first 126 unique words cover 50% of the total number of words in the BLOGS corpus, and 8095 unique words cover 90%, respectively.
At this stage of the project, we use only the bare minimum of data
cleaning e.g. convert uppercase to lowercase, remove numbers,
punctuation and stopwords, and strip extra whitespace.
Additional cleaning functionality (e.g. profanity filtering) can be
introduced later as needed.
corpBlogs <- VCorpus( VectorSource(smplBlogs) )
corpBlogs <- tm_map( corpBlogs , content_transformer( tolower ) )
corpBlogs <- tm_map( corpBlogs , removeNumbers )
corpBlogs <- tm_map( corpBlogs , removePunctuation , ucp = TRUE )
corpBlogs <- tm_map( corpBlogs , removeWords , removePunctuation(stopwords("english")) )
corpBlogs <- tm_map( corpBlogs , stripWhitespace )
corpNews <- VCorpus( VectorSource(smplNews) )
corpNews <- tm_map( corpNews , content_transformer( tolower ) )
corpNews <- tm_map( corpNews , removeNumbers )
corpNews <- tm_map( corpNews , removePunctuation , ucp = TRUE )
corpNews <- tm_map( corpNews , removeWords , removePunctuation(stopwords("english")) )
corpNews <- tm_map( corpNews , stripWhitespace )
corpTwitter <- VCorpus( VectorSource(smplTwitter) )
corpTwitter <- tm_map( corpTwitter , content_transformer( tolower ) )
corpTwitter <- tm_map( corpTwitter , removeNumbers )
corpTwitter <- tm_map( corpTwitter , removePunctuation , ucp = TRUE )
corpTwitter <- tm_map( corpTwitter , removeWords , removePunctuation(stopwords("english")) )
corpTwitter <- tm_map( corpTwitter , stripWhitespace )
The cleaning procedure creates 3 corpora (the collections of
documents) containing text from Blogs, News and Twitter.
The following examples illustrate the change to the content.
c( smplBlogs[2] , corpBlogs[[2]]$content )
## [1] "Walden Pond, Mt. Rainier, Big Sur, Everglades and so forth;"
## [2] "walden pond mt rainier big sur everglades forth"
c( smplNews[10] , corpNews[[10]]$content )
## [1] "Kids bobbed nearby, clutching the pool's edges, waiting for Chapman's verdict."
## [2] "kids bobbed nearby clutching pools edges waiting chapmans verdict"
c( smplTwitter[1] , corpTwitter[[1]]$content )
## [1] "just wanted to thank you & ask what got you started on your mission?"
## [2] "just wanted thank ask got started mission"
The purpose of this section is to understand frequencies of words and
word pairs. To achieve this, we use tokenization
(NGramTokenizer by WEKA) to split our corpora documents
(strings) into three N-grams: a unigram, a bigram, and a trigram.
The following bar plot shows the frequencies of words and word pairs in BLOGS.
BlogsTDM1 <- TermDocumentMatrix(corpBlogs, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=1,max=1))))
BlogsTDM2 <- TermDocumentMatrix(corpBlogs, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=2,max=2))))
BlogsTDM3 <- TermDocumentMatrix(corpBlogs, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=3,max=3))))
BlogsTDM1 <- removeSparseTerms(BlogsTDM1, 0.9999)
BlogsTDM2 <- removeSparseTerms(BlogsTDM2, 0.9999)
BlogsTDM3 <- removeSparseTerms(BlogsTDM3, 0.9999)
Blogs1Freq <- sort( rowSums(as.matrix(BlogsTDM1)), decreasing = TRUE )
Blogs2Freq <- sort( rowSums(as.matrix(BlogsTDM2)), decreasing = TRUE )
Blogs3Freq <- sort( rowSums(as.matrix(BlogsTDM3)), decreasing = TRUE )
par( mfcol=c(1,3) , mar=c(5,8,2,2) , las=1 , oma=c(0,0,2,0) , cex.lab=1.5 )
barplot( Blogs1Freq[1:15], horiz=TRUE, main="Unigrams", cex.names=1.3 )
barplot( Blogs2Freq[1:15], horiz=TRUE, main="Bigrams", cex.names=1.3 , xlab="Frequency" )
barplot( Blogs3Freq[1:15], horiz=TRUE, main="Trigrams", cex.names=1.0 )
title( main="15 Most Frequent N-grams in BLOGS", outer=TRUE, cex.main=1.5 )
box( which="outer", lty="solid" )
The following bar plot shows the frequencies of words and word pairs in NEWS.
NewsTDM1 <- TermDocumentMatrix(corpNews, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=1,max=1))))
NewsTDM2 <- TermDocumentMatrix(corpNews, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=2,max=2))))
NewsTDM3 <- TermDocumentMatrix(corpNews, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=3,max=3))))
NewsTDM1 <- removeSparseTerms(NewsTDM1, 0.9999)
NewsTDM2 <- removeSparseTerms(NewsTDM2, 0.9999)
NewsTDM3 <- removeSparseTerms(NewsTDM3, 0.9999)
News1Freq <- sort( rowSums(as.matrix(NewsTDM1)), decreasing = TRUE )
News2Freq <- sort( rowSums(as.matrix(NewsTDM2)), decreasing = TRUE )
News3Freq <- sort( rowSums(as.matrix(NewsTDM3)), decreasing = TRUE )
par( mfcol=c(1,3) , mar=c(5,8,2,2) , las=1 , oma=c(0,0,2,0) , cex.lab=1.5 )
barplot( News1Freq[1:15], horiz=TRUE, main="Unigrams", cex.names=1.3 )
barplot( News2Freq[1:15], horiz=TRUE, main="Bigrams", cex.names=1.3 , xlab="Frequency" )
barplot( News3Freq[1:15], horiz=TRUE, main="Trigrams", cex.names=1.0 )
title( main="15 Most Frequent N-grams in NEWS", outer=TRUE, cex.main=1.5 )
box( which="outer", lty="solid" )
The following bar plot shows the frequencies of words and word pairs in TWITTER.
TwitterTDM1 <- TermDocumentMatrix(corpTwitter, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=1,max=1))))
TwitterTDM2 <- TermDocumentMatrix(corpTwitter, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=2,max=2))))
TwitterTDM3 <- TermDocumentMatrix(corpTwitter, control=list(tokenize=function(x) NGramTokenizer(x,Weka_control(min=3,max=3))))
TwitterTDM1 <- removeSparseTerms(TwitterTDM1, 0.9999)
TwitterTDM2 <- removeSparseTerms(TwitterTDM2, 0.9999)
TwitterTDM3 <- removeSparseTerms(TwitterTDM3, 0.9999)
Twitter1Freq <- sort( rowSums(as.matrix(TwitterTDM1)), decreasing = TRUE )
Twitter2Freq <- sort( rowSums(as.matrix(TwitterTDM2)), decreasing = TRUE )
Twitter3Freq <- sort( rowSums(as.matrix(TwitterTDM3)), decreasing = TRUE )
par( mfcol=c(1,3) , mar=c(5,8,2,2) , las=1 , oma=c(0,0,2,0) , cex.lab=1.5 )
barplot( Twitter1Freq[1:15], horiz=TRUE, main="Unigrams", cex.names=1.3 )
barplot( Twitter2Freq[1:15], horiz=TRUE, main="Bigrams", cex.names=1.3 , xlab="Frequency" )
barplot( Twitter3Freq[1:15], horiz=TRUE, main="Trigrams", cex.names=1.0 )
title( main="15 Most Frequent N-grams in TWITTER", outer=TRUE, cex.main=1.5 )
box( which="outer", lty="solid" )
Sampling has proven to be a viable solution for reducing the size of
training data while providing a fairly accurate approximation of the
original distribution of words.
The exploratory data analysis demonstrates a sufficient basis for
creating a prediction algorithm. Using this analysis we are going to
build a basic N-gram model (unigram, bigram and trigram).
The resulting N-grams have been shown to be very corpus-specific. In
this regard, we plan to add an indication (parameter) of the source of
the incoming phrase for greater accuracy of prediction.
As concerns unseen/unobserved N-grams, ‘backoff’ and
‘smoothing’ are possible candidates.
Building a Shiny App that includes our algorithm is the final stage of
Capstone Project.