Word Prediction Exploratory Analysis

Exploratory Analysis for Creation of a Word Prediction Application

The code segments necessary for loading up the requisite libraries for preparation, processing, and analysis of data along with corresponding descriptions are as follows:

Load up the requisite libraries

Several libraries will be necessary, including tm, qdap, knitr, wordcloud, and RWeka. We turn off the messages in order to improve the readability of the document.

if (!("ggplot2") %in% rownames(installed.packages())) {
  install.packages("ggplot2")
} else {
  library(ggplot2)
}
if (!("xtable") %in% rownames(installed.packages())) {
  install.packages("xtable")
} else {
  library(xtable)
}
if (!("qdap") %in% rownames(installed.packages())) {
  install.packages("qdap")
} else {
  library(qdap)
}
if (!("SnowballC") %in% rownames(installed.packages())) {
  install.packages("SnowballC")
} else {
  library(SnowballC)
}
if (!("tm") %in% rownames(installed.packages())) {
  install.packages("tm")
} else {
  library(tm)
}
if (!("knitr") %in% rownames(installed.packages())) {
  install.packages("knitr")
} else {
  library(knitr)
}
if (!("RWeka") %in% rownames(installed.packages())) {
  install.packages("RWeka")
} else {
  library(RWeka)
}
if (!("wordcloud") %in% rownames(installed.packages())) {
  install.packages("wordcloud")
} else {
  library(wordcloud)
}
if (!("Rgraphviz") %in% rownames(installed.packages())) {
  source("http://bioconductor.org/biocLite.R")
  biocLite("Rgraphviz")
} else {
  library(Rgraphviz)
}

Download the Corpora

The corpora for this project is made available at location.

The provided corpora has three files, as follows:

en_US.blogs.txt - contains text from various US blogs.
en_US.news.txt - contains text from various news sites
en_US.twitter.txt - contains text from numerous twitter feeds

The steps involved in downloading the corpora are shown below; however, because this operations is performed only one time, we have commented out the segments to avoid repeated downloads. The Corpus command in “tm”" package assembles the three text files into a corpora. The terms-to-block file was downloaded from location. Here it is prepared and loaded for use down the stream for processing of the corpora.

##fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
##download.file(fileUrl, destfile="~/DATA_SCIENCE/JHU/CAPSTONE/Project/Task0/Coursera-SwiftKey.zip")
##dateDownloaded <- date()
##dateDownloaded
##Unzip the data and review the resulting files with directory structure
##unzip("Coursera-SwiftKey.zip")
##unzip("Coursera-SwiftKey.zip", list=TRUE)

setwd("C:/Users/Murtuza Ali/Desktop/DATA_SCIENCE/JHU/CAPSTONE/Project/Task0")
##The path to the raw corpus data for the United States is ./final/en_US/
wd = file.path(".", "final", "en_US")

##Corpus command from the "tm" package builds the corpora.  We use UTF-8 encoding, the dominant format for documents on the internet
doc = Corpus(DirSource(wd))

bleepers = read.csv("./MISC/Terms-to-Block.csv", skip=4)
bleepers = bleepers[,2]
bleepers = gsub(",","",bleepers)

Key Characteristics of Each Corpus

Textual content

The first two lines of the blog corpus are as follows:

blog_txt <- doc[[1]][[1]]
blog_txt[1:2]

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan âgodsâ."     
## [2] "We love you Mr. Brown."

The first two lines of the news corpus are as follows:

news_txt <- doc[[2]][[1]]
news_txt[1:2]

## [1] "He wasn't home alone, apparently."                                                                                                                        
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."

The first two lines of the twitter corpus are as follows:

twitter_txt <- doc[[3]][[1]]
twitter_txt[1:2]

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

Total Number of Words in Blog, News, and Twitter

The table below illustrates the number of words in each corpus:

blogWn = word_count(blog_txt, byrow=FALSE)
newsWn = word_count(news_txt, byrow=FALSE)
twitterWn = word_count(twitter_txt, byrow=FALSE)
df  = data.frame("Number of Words in Blog" = blogWn,
                 "Number of Words in News" = newsWn, "Number of Words in Twitter" = twitterWn)
colnames(df) = c("Number of Words in Blog", "Number of Words in News", "Number of Words in Twitter")
print(xtable(df, display = c("s","d","d","d")),
      type="html")

	Number of Words in Blog	Number of Words in News	Number of Words in Twitter
1	36893516	2579113	29430648

Total Number of Lines in Blog, News, and Twitter

The table below illustrates the number of words in each corpus:

blog_lines = as.numeric(unlist(lapply(blog_txt, nchar)))
blogLn = length(blog_lines)
news_lines = as.numeric(unlist(lapply(news_txt, nchar)))
newsLn = length(news_lines)
twitter_lines = as.numeric(unlist(lapply(twitter_txt, nchar)))
twitterLn = length(twitter_lines)
df  = data.frame("Number of Lines in Blog" = blogLn,
                 "Number of Lines in News" = newsLn, "Number of Lines in Twitter" = twitterLn)
colnames(df) = c("Number of Lines in Blog", "Number of Lines in News", "Number of Lines in Twitter")
print(xtable(df, display = c("s","d","d","d")),
      type="html")

	Number of Lines in Blog	Number of Lines in News	Number of Lines in Twitter
1	899288	77259	2360148

Total Number of Characters in the Longest Line in Blog, News, and Twitter

The table below illustrates the number of characters in the longest line in each corpus:

blogCn = blog_lines[head(order(blog_lines, decreasing=TRUE),1)]
newsCn = news_lines[head(order(news_lines, decreasing=TRUE),1)]
twitterCn = twitter_lines[head(order(twitter_lines, decreasing=TRUE),1)]
df  = data.frame("Number of Characters in the Longest Line in Blog" = blogCn,
                 "Number of Characters in the Longest Line in News" = newsCn, "Number of Characters in the Longest Line in Twitter" = twitterCn)
colnames(df) = c("Number of Characters in the Longest Line in Blog", "Number of Characters in the Longest Line in News", "Number of Characters in the Longest Line in Twitter")
print(xtable(df, display = c("s","d","d","d")),
      type="html")

	Number of Characters in the Longest Line in Blog	Number of Characters in the Longest Line in News	Number of Characters in the Longest Line in Twitter
1	40835	5760	213

Extract a Sample

As can be seen from the word counts, the corpora is too large to allow for efficient and effective processing within the R environment on a local machine. Sampling is therefore necessary; so, we draw a sample (n=5500 sample) from each of the original corpora for further analyses. An appropriate seed is set for reproducibility. The files as prepended with “pre” to denote “before processing”.

set.seed(1001)
pre_blog <- sample(doc[[1]][[1]],5500)
pre_news <- sample(doc[[2]][[1]],5500)
pre_twitter <- sample(doc[[3]][[1]],5500)

Preprocess the Corpora

The next step is preprocessing in order to eliminate unnecessary words, URLs, punctuations, etc. to make the final word prediction model not only more friendly, but also compliant with standard models of natural language processing (NLP).

Remove profane words

This step removes the bleep words.

doc = tm_map(doc, removeWords, bleepers, mc.cores=1)

Convert text to lower case, remove numbers, remove punctuations, strip whitespaces

For this run, punctuations have been retained; however, they can be turned off or on at the ready.

# http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files
# http://stackoverflow.com/questions/18153504/removing-non-english-text-from-corpus-in-r-using-tm
removeNonASCII <- content_transformer(function(x) iconv(x, "latin1", "ASCII", sub=""))
doc = tm_map(doc, removeNonASCII, mc.cores=1)

# http://stackoverflow.com/questions/14281282/
# how-to-write-custom-removepunctuation-function-to-better-deal-with-unicode-cha
# http://stackoverflow.com/questions/8697079/remove-all-punctuation-except-apostrophes-in-r
customRemovePunctuation <- content_transformer(function(x) {
  x <- gsub("[[:punct:]]"," ",tolower(x))
  return(x)
})

doc = tm_map(doc, customRemovePunctuation, mc.cores=1)

doc = tm_map(doc, content_transformer(tolower), mc.cores=1)
doc = tm_map(doc, removeNumbers, mc.cores=1)
#doc = tm_map(doc, removePunctuation, mc.cores=1)
doc = tm_map(doc, stripWhitespace, mc.cores=1)

Remove URLs

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
removeWWW <- function(x) gsub("www[[:alnum:]]*", "", x)
doc = tm_map(doc, content_transformer(removeURL))
doc = tm_map(doc, content_transformer(removeWWW))

Remove characters like /, @, , â, |

specialCharFilter = content_transformer(function(x, pattern) gsub(pattern, " ", x))
doc = tm_map(doc, specialCharFilter, "/|@|\\|â|||â|â")

Remove English stop words and perform stemming; that is, remove common word endings for English, such as ‘es’, ‘ed’ and ‘s’.

For this run, stopword and stem features have not been deployed; however, they can be turned off or on at the ready.

#doc = tm_map(doc, removeWords, stopwords("english"))
#doc = tm_map(doc, stemDocument)

Exploratory Analysis

With the completion of preprocessing, we draw a new sample (n=5500 sample) from each of the original corpora for further exploration An appropriate seed is set for reproducibility. The files as prepended with “post” to denote “after processing”.

Extract 5500 sample lines

set.seed(5005)
doc[[1]][[1]] = sample(doc[[1]][[1]],5500)
doc[[2]][[1]] = sample(doc[[2]][[1]],5500)
doc[[3]][[1]] = sample(doc[[3]][[1]],5500)

post_blog <- doc[[1]][[1]]
post_news <- doc[[2]][[1]]
post_twitter <- doc[[3]][[1]]

Extract the top 10 unigram tokens in the blog, news, and twitter corpora “before”" and “after”" preprocessing. The “visualize” function is built to make the code more efficient, as the segment is repeated across each corpus.

visualize <- function(feed){
  corpus <- VCorpus(VectorSource(feed))
  dtm <- DocumentTermMatrix(corpus)
  token <- sort(apply(dtm,2,sum),decreasing = TRUE)
  freq <- findFreqTerms(dtm,10)
  result <- list(token,freq)
  result
}

pre_blogO <- visualize(pre_blog)
pre_blog_tokens <- pre_blogO[[1]]
pre_blog_freq <- pre_blogO[[2]]

post_blogO <- visualize(post_blog)
post_blog_tokens <- post_blogO[[1]]
post_blog_freq <- post_blogO[[2]]

pre_newsO <- visualize(pre_news)
pre_news_tokens <- pre_newsO[[1]]
pre_news_freq <- pre_newsO[[2]]

post_newsO <- visualize(post_news)
post_news_tokens <- post_newsO[[1]]
post_news_freq <- post_newsO[[2]]

pre_twitterO <- visualize(pre_twitter)
pre_twitter_tokens <- pre_twitterO[[1]]
pre_twitter_freq <- pre_twitterO[[2]]

post_twitterO <- visualize(post_twitter)
post_twitter_tokens <- post_twitterO[[1]]
post_twitter_freq <- post_twitterO[[2]]

Plot the Top 10 Unigram tokens for each corpus “before” and “after” processing.

par(mfrow=c(3,2))
barplot(pre_blog_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram Blog Tokens: Before Preprocessing', names.arg=names(pre_blog_tokens)[1:10], col="blue", las=2)
barplot(post_blog_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram Blog Tokens: After Preprocessing', names.arg=names(post_blog_tokens)[1:10], col="green", las=2)

barplot(pre_news_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram News Tokens: Before Preprocessing', names.arg=names(pre_news_tokens)[1:10], col="blue", las=2)
barplot(post_news_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram News Tokens: After Preprocessing', names.arg=names(post_news_tokens)[1:10], col="green", las=2)

barplot(pre_twitter_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram Twitter Tokens: Before Preprocessing', names.arg=names(pre_twitter_tokens)[1:10], col="blue", las=2)
barplot(post_twitter_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram Twitter Tokens: After Preprocessing', names.arg=names(post_twitter_tokens)[1:10], col="green", las=2)

Create a Document Term Matrix for further exploration.

dtm = DocumentTermMatrix(doc)
wordfreq = colSums(as.matrix(dtm))

Display word clouds of the preprocessed corpora, justaposing words that occur in frequencies of 100 and 500.

par(mfrow=c(1,2))
wordcloud(names(wordfreq), wordfreq, min.freq=100, colors=brewer.pal(6, "Dark2"))
wordcloud(names(wordfreq), wordfreq, min.freq=500, colors=brewer.pal(6, "Dark2"))

Explore and plot “unique”" words in relation to the total number of words in each corpus.

par(mfrow=c(1,1))
plot(cumsum(post_blog_tokens)/sum(post_blog_tokens),type="l",col="black",
     xlab="Number of words",
     ylab="Ratio of unique words to total number of words")
lines(cumsum(post_news_tokens)/sum(post_news_tokens),type="l",col="blue")
lines(cumsum(post_twitter_tokens)/sum(post_twitter_tokens),type="l",col="green")
legend("bottomright",legend = c("blog","news","twitter"),
       col=c("black","blue","green"),lwd=2)

As the plot indicates, nearly 5000 unique and most-frequently occurring words make up more than 80 per cent of each of the corpora. This validates the sampling strategy used for model buiding.

Extract the top 10 bigram abd trigram tokens in the blog, news, and twitter corpora “before” and “after” preprocessing.

tken = NGramTokenizer(post_blog, Weka_control(min = 2, max = 2))
tkCnt = table(tken)
biblog_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)

tken = NGramTokenizer(post_news, Weka_control(min = 2, max = 2))
tkCnt = table(tken)
binews_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)

tken = NGramTokenizer(post_twitter, Weka_control(min = 2, max = 2))
tkCnt = table(tken)
bitwitter_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)

tken = NGramTokenizer(post_blog, Weka_control(min = 3, max = 3))
tkCnt = table(tken)
triblog_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)

tken = NGramTokenizer(post_news, Weka_control(min = 3, max = 3))
tkCnt = table(tken)
trinews_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)

tken = NGramTokenizer(post_twitter, Weka_control(min = 3, max = 3))
tkCnt = table(tken)
tritwitter_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)

Plot bigram and trigram tokens in the preprocessed blog corpus.

par(mfrow=c(1,2), mar = c(9,6,4,2))
barplot(biblog_dt[1:10], ylab='Frequency', main= 'TOP 10 bigram blog tokens', names.arg=names(biblog_dt)[1:10], col="green", las=2)
barplot(triblog_dt[1:10], ylab='Frequency', main= 'TOP 10 trigram blog tokens', names.arg=names(triblog_dt)[1:10], col="red", las=2)

Plot bigram and trigram tokens in the preprocessed news corpus.

par(mfrow=c(1,2), mar = c(9,6,4,2))
barplot(binews_dt[1:10], ylab='Frequency', main= 'TOP 10 bigram news tokens', names.arg=names(binews_dt)[1:10], col="green", las=2)
barplot(trinews_dt[1:10], ylab='Frequency', main= 'TOP 10 trigram news tokens', names.arg=names(trinews_dt)[1:10], col="red", las=2)

Plot bigram and trigram tokens in the preprocessed twitter corpus.

par(mfrow=c(1,2), mar = c(9,6,4,2))
barplot(bitwitter_dt[1:10], ylab='Frequency', main= 'TOP 10 bigram twitter tokens', names.arg=names(bitwitter_dt)[1:10], col="green", las=2)
barplot(tritwitter_dt[1:10], ylab='Frequency', main= 'TOP 10 trigram twitter tokens', names.arg=names(tritwitter_dt)[1:10], col="red", las=2)

The bigrams and trigrams observed in the plots above will form the underpinning of the prediction model. Further, orrelation between frequently-occurring words will form the basis of prediction. Here, we build a strong correlation map (r = 0.7) for the top 10 words that occur 600 times.

plot(dtm, terms=findFreqTerms(dtm, lowfreq=600)[1:10], corThreshold=0.7)