This milestone report is the first important task in the Coursera Capstone Project of the Data Science Specialization. One of the final deliverables of the Capstone Project is a shiny app where the user can enter a word sequence and the app predicts with fairly good accuracy the next word the user might enter within a reasonable elapse time. The objective of this milestone report is therefore to demonstrate that we are on track to create the prediction algorithm for the app by presenting findings of the exploratory data analysis performed thus far, as well as to outline the future work required to develop the text prediction system.
Swiftkey is a predictive text app, which employs smart prediction technology to make it easier to input text into a mobile device. It accomplishes this by predicting the next word the user intends to type and makes suggestions. Through text analysis and cleaning, sampling, prediction and evaluation, the final goal is to build a shiny app where the user can enter a word sequence and the app predicts with fairly good accuracy the next word the user might enter within a reasonable elapse time.
While external datasets may be used to augment the model later in the project, the data to be used at this point is to be downloaded at: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
Supplied for our initial use is a collection of text sources in Dutch, Finnish, English and Russian. For this task, I will only be using the three English files in the “en_US” folder. These three files represents data from blogs, news and Twitter. To get the ball rolling, we start by loading the files.
set.seed(2015)
## Given that the files are extremely large, check that the datasets have not already been loaded before reading them in.
if (!exists("textBlogs"))
textBlogs <- readLines(fnameBlogs, encoding="UTF-8", skipNul=TRUE)
if (!exists("textNews")) {
textNews <- readLines(fnameNews, encoding="UTF-8", skipNul=TRUE)
sampleTextNews <- textNews[sample(1:length(textNews), 15000)]
}
if (!exists("textTwitter"))
textTwitter <- readLines(fnameTwitter, encoding="UTF-8", skipNul=TRUE)
Given the size of the data and given my assumption that the news dataset will be cleanest of the three in terms of - fewer spelling or grammatical errors - less problematic language elements, eg. contractions, emoticons, hashtagging as in tweets, etc - absence of profanity (hopefully), I chose the news dataset as a representative for a closer look.
However, even a visual inspection of the News text at this point reveals that the data is still riddled with unicode, odd non-English characters, extraneous backslashes for escaping, etc.
## Get the file sizes
fsizeBlogs <- utils:::format.object_size(file.info(fnameBlogs)$size, "auto")
fsizeNews <- utils:::format.object_size(file.info(fnameNews)$size, "auto")
fsizeTwitter <- utils:::format.object_size(file.info(fnameTwitter)$size, "auto")
## Number of unique words in each file.
blobBlogs <- unlist(stri_extract_all_words(textBlogs))
blobNews <- unlist(stri_extract_all_words(textNews))
blobTwitter <- unlist(stri_extract_all_words(textTwitter))
data.frame(source = c("Blogs", "News", "Twitter"),
File_size = c(fsizeBlogs,
fsizeNews,
fsizeTwitter),
Lines = c(length(textBlogs),
length(textNews),
length(textTwitter)),
Words = c(length(blobBlogs),
length(blobNews),
length(blobTwitter)),
Words_Unique = c(length(unique(blobBlogs)),
length(unique(blobNews)),
length(unique(blobTwitter))))
## source File_size Lines Words Words_Unique
## 1 Blogs 200.4 Mb 899288 37546249 396194
## 2 News 196.3 Mb 77259 2674536 100973
## 3 Twitter 159.4 Mb 2360148 30093410 486660
To simplify the data exploration, I opted to use only the News dataset as my source for the text corpus.
sourceEN <- c(sampleTextNews)
## sourceEN <- c(textBlogs, textNews, textTwitter)
corpus <- Corpus(VectorSource(sourceEN))
Now that the corpus is ready, before it can be used for text mining, a couple of transformations is required - changing letters to lower case, removing punctuation marks, numbers and removing common English stopwords.
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
Next I perform stemming, which essentially removes affixes from words. For example, “run”, “runs” and “running” all converges to “run”.
corpus <- tm_map(corpus, stemDocument, language=("english"))
A Term Document Matrix (TDM) is then created to reflects the number of times each word in the corpus is found in each of the sources.
It is of interest to note that the Document Term Matrix is a transposition of the Term Document Matrix.
tdm <- TermDocumentMatrix(corpus)
##findFreqTerms(tdm, 5) ## Display words which occur more that 5 times.
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
## Now we can sort the maxtrix to see the most frequently used words.
freqUnigrams <- sort(colSums(dtm2), decreasing=TRUE)
##head(freqUnigrams)
dfUnigrams <- data.frame(word=names(freqUnigrams), freq=freqUnigrams)
Here I create a wordcloud for a more intuitive visual of the words that make up the corpus. This is followed by various histograms detailing the most commonly recurring words, 2-word and 3-word phases.
# Create a WordCloud to Visualize the Text Data
wordcloud(names(freqUnigrams), freqUnigrams, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
## Plot the unigram frequency histogram to show the top 20 most frequently recurring words.
ggplot(dfUnigrams[1:20, ], aes(x=reorder(word, freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="20 most common unigrams in News")
# NGramTokenizer splits strings into n-grams with given minimal and maximal numbers of grams.
tokenizerBigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
tokenizerTrigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
tokenizerFourgram <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
## Plot the bigram frequency histogram to show the top 20 most frequently recurring 2-word phrases.
bigrams <- TermDocumentMatrix(corpus, control=list(tokenize=tokenizerBigram))
## From our first look at the TDM we know that there are many terms which do not occur very often. It might make sense to simply remove these sparse terms from the analysis.
bigrams2 <- removeSparseTerms(bigrams, 0.999)
freqBigrams2 <- sort(rowSums(as.matrix(bigrams2)), decreasing=TRUE)
dfBigrams2 <- data.frame(word=names(freqBigrams2), freq=freqBigrams2)
ggplot(dfBigrams2[1:20, ], aes(x=reorder(word, freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="20 most common bigrams in News")
## Plot the trigram frequency histogram to show the top 20 most frequently recurring 3-word phrases.
trigrams <- TermDocumentMatrix(corpus, control=list(tokenize=tokenizerTrigram))
trigrams2 <- removeSparseTerms(trigrams, 0.999)
freqTrigrams2 <- sort(rowSums(as.matrix(trigrams2)), decreasing=TRUE)
dfTrigrams2 <- data.frame(word=names(freqTrigrams2), freq=freqTrigrams2)
ggplot(dfTrigrams2[1:20, ], aes(x=reorder(word, freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="20 most common Trigrams in News")
## Warning: Removed 17 rows containing missing values (position_stack).