We are well into an era of big data and small devices. As data continue to grow and user devices continue to shrink, the challenge lies in leveraging the intelligence of big data to make small form-factor devices more usable. A key impediment of small devices is the lack of traditional key boards. This shortcoming can be alleviated by providing the users with as much “preemptive” assistance with typing as possible. This approach entails predicting words even before they are typed, based on knowledge of the users’ historical entries.
The overarching goal of this project is to create a knowledge-based algorithm that predicts and facilitates the data entry by users, leveraging the corpora compiled over media, such as blogs, news, and twitter feeds. The objective of this specific part of the project is to download, process, and explore the nature of the corpora in order to prepare for the creation of the predictive application in Shiny.
This paper represents the first of two parts of this project. As such, the focus will be limited to downloading the data, inspecting, preprocessing, and exploring the data, and then laying a foundation for predictive algorithm. The implementation of the algorithm will be dealt with in the second and final part of this project.
The code segments necessary for loading up the requisite libraries for preparation, processing, and analysis of data along with corresponding descriptions are as follows:
Several libraries will be necessary, including tm, qdap, knitr, wordcloud, and RWeka. We turn off the messages in order to improve the readability of the document.
if (!("ggplot2") %in% rownames(installed.packages())) {
install.packages("ggplot2")
} else {
library(ggplot2)
}
if (!("xtable") %in% rownames(installed.packages())) {
install.packages("xtable")
} else {
library(xtable)
}
if (!("qdap") %in% rownames(installed.packages())) {
install.packages("qdap")
} else {
library(qdap)
}
if (!("SnowballC") %in% rownames(installed.packages())) {
install.packages("SnowballC")
} else {
library(SnowballC)
}
if (!("tm") %in% rownames(installed.packages())) {
install.packages("tm")
} else {
library(tm)
}
if (!("knitr") %in% rownames(installed.packages())) {
install.packages("knitr")
} else {
library(knitr)
}
if (!("RWeka") %in% rownames(installed.packages())) {
install.packages("RWeka")
} else {
library(RWeka)
}
if (!("wordcloud") %in% rownames(installed.packages())) {
install.packages("wordcloud")
} else {
library(wordcloud)
}
if (!("Rgraphviz") %in% rownames(installed.packages())) {
source("http://bioconductor.org/biocLite.R")
biocLite("Rgraphviz")
} else {
library(Rgraphviz)
}
The corpora for this project is made available at location.
The provided corpora has three files, as follows:The steps involved in downloading the corpora are shown below; however, because this operations is performed only one time, we have commented out the segments to avoid repeated downloads. The Corpus command in “tm”" package assembles the three text files into a corpora. The terms-to-block file was downloaded from location. Here it is prepared and loaded for use down the stream for processing of the corpora.
##fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
##download.file(fileUrl, destfile="~/DATA_SCIENCE/JHU/CAPSTONE/Project/Task0/Coursera-SwiftKey.zip")
##dateDownloaded <- date()
##dateDownloaded
##Unzip the data and review the resulting files with directory structure
##unzip("Coursera-SwiftKey.zip")
##unzip("Coursera-SwiftKey.zip", list=TRUE)
setwd("C:/Users/Murtuza Ali/Desktop/DATA_SCIENCE/JHU/CAPSTONE/Project/Task0")
##The path to the raw corpus data for the United States is ./final/en_US/
wd = file.path(".", "final", "en_US")
##Corpus command from the "tm" package builds the corpora. We use UTF-8 encoding, the dominant format for documents on the internet
doc = Corpus(DirSource(wd))
bleepers = read.csv("./MISC/Terms-to-Block.csv", skip=4)
bleepers = bleepers[,2]
bleepers = gsub(",","",bleepers)
The first two lines of the blog corpus are as follows:
blog_txt <- doc[[1]][[1]]
blog_txt[1:2]
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan âgodsâ."
## [2] "We love you Mr. Brown."
The first two lines of the news corpus are as follows:
news_txt <- doc[[2]][[1]]
news_txt[1:2]
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
The first two lines of the twitter corpus are as follows:
twitter_txt <- doc[[3]][[1]]
twitter_txt[1:2]
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
The table below illustrates the number of words in each corpus:
blogWn = word_count(blog_txt, byrow=FALSE)
newsWn = word_count(news_txt, byrow=FALSE)
twitterWn = word_count(twitter_txt, byrow=FALSE)
df = data.frame("Number of Words in Blog" = blogWn,
"Number of Words in News" = newsWn, "Number of Words in Twitter" = twitterWn)
colnames(df) = c("Number of Words in Blog", "Number of Words in News", "Number of Words in Twitter")
print(xtable(df, display = c("s","d","d","d")),
type="html")
Number of Words in Blog | Number of Words in News | Number of Words in Twitter | |
---|---|---|---|
1 | 36893516 | 2579113 | 29430648 |
The table below illustrates the number of words in each corpus:
blog_lines = as.numeric(unlist(lapply(blog_txt, nchar)))
blogLn = length(blog_lines)
news_lines = as.numeric(unlist(lapply(news_txt, nchar)))
newsLn = length(news_lines)
twitter_lines = as.numeric(unlist(lapply(twitter_txt, nchar)))
twitterLn = length(twitter_lines)
df = data.frame("Number of Lines in Blog" = blogLn,
"Number of Lines in News" = newsLn, "Number of Lines in Twitter" = twitterLn)
colnames(df) = c("Number of Lines in Blog", "Number of Lines in News", "Number of Lines in Twitter")
print(xtable(df, display = c("s","d","d","d")),
type="html")
Number of Lines in Blog | Number of Lines in News | Number of Lines in Twitter | |
---|---|---|---|
1 | 899288 | 77259 | 2360148 |
The table below illustrates the number of characters in the longest line in each corpus:
blogCn = blog_lines[head(order(blog_lines, decreasing=TRUE),1)]
newsCn = news_lines[head(order(news_lines, decreasing=TRUE),1)]
twitterCn = twitter_lines[head(order(twitter_lines, decreasing=TRUE),1)]
df = data.frame("Number of Characters in the Longest Line in Blog" = blogCn,
"Number of Characters in the Longest Line in News" = newsCn, "Number of Characters in the Longest Line in Twitter" = twitterCn)
colnames(df) = c("Number of Characters in the Longest Line in Blog", "Number of Characters in the Longest Line in News", "Number of Characters in the Longest Line in Twitter")
print(xtable(df, display = c("s","d","d","d")),
type="html")
Number of Characters in the Longest Line in Blog | Number of Characters in the Longest Line in News | Number of Characters in the Longest Line in Twitter | |
---|---|---|---|
1 | 40835 | 5760 | 213 |
As can be seen from the word counts, the corpora is too large to allow for efficient and effective processing within the R environment on a local machine. Sampling is therefore necessary; so, we draw a sample (n=5500 sample) from each of the original corpora for further analyses. An appropriate seed is set for reproducibility. The files as prepended with “pre” to denote “before processing”.
set.seed(1001)
pre_blog <- sample(doc[[1]][[1]],5500)
pre_news <- sample(doc[[2]][[1]],5500)
pre_twitter <- sample(doc[[3]][[1]],5500)
The next step is preprocessing in order to eliminate unnecessary words, URLs, punctuations, etc. to make the final word prediction model not only more friendly, but also compliant with standard models of natural language processing (NLP).
This step removes the bleep words.
doc = tm_map(doc, removeWords, bleepers, mc.cores=1)
For this run, punctuations have been retained; however, they can be turned off or on at the ready.
# http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files
# http://stackoverflow.com/questions/18153504/removing-non-english-text-from-corpus-in-r-using-tm
removeNonASCII <- content_transformer(function(x) iconv(x, "latin1", "ASCII", sub=""))
doc = tm_map(doc, removeNonASCII, mc.cores=1)
# http://stackoverflow.com/questions/14281282/
# how-to-write-custom-removepunctuation-function-to-better-deal-with-unicode-cha
# http://stackoverflow.com/questions/8697079/remove-all-punctuation-except-apostrophes-in-r
customRemovePunctuation <- content_transformer(function(x) {
x <- gsub("[[:punct:]]"," ",tolower(x))
return(x)
})
doc = tm_map(doc, customRemovePunctuation, mc.cores=1)
doc = tm_map(doc, content_transformer(tolower), mc.cores=1)
doc = tm_map(doc, removeNumbers, mc.cores=1)
#doc = tm_map(doc, removePunctuation, mc.cores=1)
doc = tm_map(doc, stripWhitespace, mc.cores=1)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
removeWWW <- function(x) gsub("www[[:alnum:]]*", "", x)
doc = tm_map(doc, content_transformer(removeURL))
doc = tm_map(doc, content_transformer(removeWWW))
specialCharFilter = content_transformer(function(x, pattern) gsub(pattern, " ", x))
doc = tm_map(doc, specialCharFilter, "/|@|\\|â|||â|â")
For this run, stopword and stem features have not been deployed; however, they can be turned off or on at the ready.
#doc = tm_map(doc, removeWords, stopwords("english"))
#doc = tm_map(doc, stemDocument)
With the completion of preprocessing, we draw a new sample (n=5500 sample) from each of the original corpora for further exploration An appropriate seed is set for reproducibility. The files as prepended with “post” to denote “after processing”.
set.seed(5005)
doc[[1]][[1]] = sample(doc[[1]][[1]],5500)
doc[[2]][[1]] = sample(doc[[2]][[1]],5500)
doc[[3]][[1]] = sample(doc[[3]][[1]],5500)
post_blog <- doc[[1]][[1]]
post_news <- doc[[2]][[1]]
post_twitter <- doc[[3]][[1]]
Extract the top 10 unigram tokens in the blog, news, and twitter corpora “before”" and “after”" preprocessing. The “visualize” function is built to make the code more efficient, as the segment is repeated across each corpus.
visualize <- function(feed){
corpus <- VCorpus(VectorSource(feed))
dtm <- DocumentTermMatrix(corpus)
token <- sort(apply(dtm,2,sum),decreasing = TRUE)
freq <- findFreqTerms(dtm,10)
result <- list(token,freq)
result
}
pre_blogO <- visualize(pre_blog)
pre_blog_tokens <- pre_blogO[[1]]
pre_blog_freq <- pre_blogO[[2]]
post_blogO <- visualize(post_blog)
post_blog_tokens <- post_blogO[[1]]
post_blog_freq <- post_blogO[[2]]
pre_newsO <- visualize(pre_news)
pre_news_tokens <- pre_newsO[[1]]
pre_news_freq <- pre_newsO[[2]]
post_newsO <- visualize(post_news)
post_news_tokens <- post_newsO[[1]]
post_news_freq <- post_newsO[[2]]
pre_twitterO <- visualize(pre_twitter)
pre_twitter_tokens <- pre_twitterO[[1]]
pre_twitter_freq <- pre_twitterO[[2]]
post_twitterO <- visualize(post_twitter)
post_twitter_tokens <- post_twitterO[[1]]
post_twitter_freq <- post_twitterO[[2]]
Plot the Top 10 Unigram tokens for each corpus “before” and “after” processing.
par(mfrow=c(3,2))
barplot(pre_blog_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram Blog Tokens: Before Preprocessing', names.arg=names(pre_blog_tokens)[1:10], col="blue", las=2)
barplot(post_blog_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram Blog Tokens: After Preprocessing', names.arg=names(post_blog_tokens)[1:10], col="green", las=2)
barplot(pre_news_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram News Tokens: Before Preprocessing', names.arg=names(pre_news_tokens)[1:10], col="blue", las=2)
barplot(post_news_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram News Tokens: After Preprocessing', names.arg=names(post_news_tokens)[1:10], col="green", las=2)
barplot(pre_twitter_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram Twitter Tokens: Before Preprocessing', names.arg=names(pre_twitter_tokens)[1:10], col="blue", las=2)
barplot(post_twitter_tokens[1:10], xlab='Tokens', ylab='Frequency', main= 'Top 10 Unigram Twitter Tokens: After Preprocessing', names.arg=names(post_twitter_tokens)[1:10], col="green", las=2)
Create a Document Term Matrix for further exploration.
dtm = DocumentTermMatrix(doc)
wordfreq = colSums(as.matrix(dtm))
Display word clouds of the preprocessed corpora, justaposing words that occur in frequencies of 100 and 500.
par(mfrow=c(1,2))
wordcloud(names(wordfreq), wordfreq, min.freq=100, colors=brewer.pal(6, "Dark2"))
wordcloud(names(wordfreq), wordfreq, min.freq=500, colors=brewer.pal(6, "Dark2"))
Explore and plot “unique”" words in relation to the total number of words in each corpus.
par(mfrow=c(1,1))
plot(cumsum(post_blog_tokens)/sum(post_blog_tokens),type="l",col="black",
xlab="Number of words",
ylab="Ratio of unique words to total number of words")
lines(cumsum(post_news_tokens)/sum(post_news_tokens),type="l",col="blue")
lines(cumsum(post_twitter_tokens)/sum(post_twitter_tokens),type="l",col="green")
legend("bottomright",legend = c("blog","news","twitter"),
col=c("black","blue","green"),lwd=2)
As the plot indicates, nearly 5000 unique and most-frequently occurring words make up more than 80 per cent of each of the corpora. This validates the sampling strategy used for model buiding.
Extract the top 10 bigram abd trigram tokens in the blog, news, and twitter corpora “before” and “after” preprocessing.
tken = NGramTokenizer(post_blog, Weka_control(min = 2, max = 2))
tkCnt = table(tken)
biblog_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)
tken = NGramTokenizer(post_news, Weka_control(min = 2, max = 2))
tkCnt = table(tken)
binews_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)
tken = NGramTokenizer(post_twitter, Weka_control(min = 2, max = 2))
tkCnt = table(tken)
bitwitter_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)
tken = NGramTokenizer(post_blog, Weka_control(min = 3, max = 3))
tkCnt = table(tken)
triblog_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)
tken = NGramTokenizer(post_news, Weka_control(min = 3, max = 3))
tkCnt = table(tken)
trinews_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)
tken = NGramTokenizer(post_twitter, Weka_control(min = 3, max = 3))
tkCnt = table(tken)
tritwitter_dt = head(tkCnt[order(tkCnt, decreasing=TRUE)],10)
Plot bigram and trigram tokens in the preprocessed blog corpus.
par(mfrow=c(1,2), mar = c(9,6,4,2))
barplot(biblog_dt[1:10], ylab='Frequency', main= 'TOP 10 bigram blog tokens', names.arg=names(biblog_dt)[1:10], col="green", las=2)
barplot(triblog_dt[1:10], ylab='Frequency', main= 'TOP 10 trigram blog tokens', names.arg=names(triblog_dt)[1:10], col="red", las=2)
Plot bigram and trigram tokens in the preprocessed news corpus.
par(mfrow=c(1,2), mar = c(9,6,4,2))
barplot(binews_dt[1:10], ylab='Frequency', main= 'TOP 10 bigram news tokens', names.arg=names(binews_dt)[1:10], col="green", las=2)
barplot(trinews_dt[1:10], ylab='Frequency', main= 'TOP 10 trigram news tokens', names.arg=names(trinews_dt)[1:10], col="red", las=2)
Plot bigram and trigram tokens in the preprocessed twitter corpus.
par(mfrow=c(1,2), mar = c(9,6,4,2))
barplot(bitwitter_dt[1:10], ylab='Frequency', main= 'TOP 10 bigram twitter tokens', names.arg=names(bitwitter_dt)[1:10], col="green", las=2)
barplot(tritwitter_dt[1:10], ylab='Frequency', main= 'TOP 10 trigram twitter tokens', names.arg=names(tritwitter_dt)[1:10], col="red", las=2)
The bigrams and trigrams observed in the plots above will form the underpinning of the prediction model. Further, orrelation between frequently-occurring words will form the basis of prediction. Here, we build a strong correlation map (r = 0.7) for the top 10 words that occur 600 times.
plot(dtm, terms=findFreqTerms(dtm, lowfreq=600)[1:10], corThreshold=0.7)
The objective of this specific part of the project is to download, process, and explore the nature of the corpora in order to prepare for the creation of the word prediction application in Shiny. This exploratory exercise provided a view into the NLP corpora and how it is built and explored. Further, exploration of the corpora revealed the nature of the text. This knowledge will be vital for next stage of the project focusing on creation of a knowledge-based algorithm that predicts and facilitates the data entry by users,