This report presents a milestone review covering initial exploratory data analysis on the supplied Swiftkey corpora for the Coursera Data Science Capstone project. The purpose of the report is to:
The capstone dataset is a corpora supplied for the Data Science Capstone from the Coursera Data Science Specialization. The dataset was obtained directly from the course website.
The dataset is provided as a NA MB zip archive.
Within the archive there is a corpus for each of the locales NA, NA, NA and NA. For each locale a corpus consisting of text from blogs, news and twitter sources is provided. While there are multiple corpus provided the focus of the initial analysis and exploration was on the American English (en_US) file set.
Once the corpus has been loaded some basic statistics are gathered for each of the files. The RWeka package is then used to extract the most frequently occuring n-grams and these are used to provide an overview of the commonly occuring n-grams in the data with a view to establishing teh soundness of using the n-gram frequences as the basis for a predictive model.
| File | Size (Mb) | Lines | Words | Characters |
|---|---|---|---|---|
| en_US.blogs.txt | 200.42 | 899288 | 38154238 | 208361438 |
| en_US.news.txt | 196.28 | 77259 | 2693898 | 15683765 |
| en_US.twitter.txt | 159.36 | 2360148 | 30218166 | 162385035 |
The supplied dataset is too large to perform analysis on in a timely fashion. To perform the initial analysis a training set consisting of a random sample of 1% of the individual files within the corpora were selected.
To be able to understand the data and use it to explore and build a predictive model we need to clean the data. We do so using the filtered training set as a starting point and strip out data which does not directly contribute to our understanding of the data. We do this by:
After a sample of data has been cleaned and loaded we are able to perform additional analysis by decomposing the sample text into n-grams (1-, 2- and 3-) and examining the frequency of the top 50 n-grams. For each of the n-grams the top 50 terms by frequency are plotted along with word clouds to give an alternate representation.
Given the initial exploratory analysis it appears that n-gram models will provide a sound basis for a predictive model to determine candidate next words when constructing a sentence.
To build the predictive model we will analysis additional portions of the corpus. The accuracy of the frequency matrix will be a tradeoff against the memory utilisation and time to process. Given the concentration of coverage in more frequently occuring terms and the long tail of n-grams with a low number of occurences we may wish to discard words which are only seldom used.
Once we have constructed the predictive model we will make it available in an interactive Shiny application which is able to predict the next word required when entering text.
The R source code used to perform the preliminary analysis follows:
library(downloader)
library(tm)
library(SnowballC)
library(stringi)
library(knitr)
library(wordcloud)
library(RWeka)
library(magrittr)
library(ggplot2)
if (!file.exists('data')) {
dir.create(file.path(getwd(), 'data'))
}
setwd(file.path(getwd(), './data'))
# initial data set
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
file <- "Coursera-SwiftKey.zip"
if (!file.exists(file)) {
download(url, file)
unzip(file)
}
data.locale <- "en_US"
data.location <- paste(data.root, data.locale, sep="/")
blogs.file <- paste(data.location, "/", data.locale, ".blogs.txt", sep="")
news.file <- paste(data.location, "/", data.locale, ".news.txt", sep="")
twitter.file <- paste(data.location, "/", data.locale, ".twitter.txt", sep="")
blogs.size <- round(file.info(blogs.file)$size / 1024^2, digits=2)
blogs.lines <- readLines(blogs.file, skipNul = TRUE)
blogs.length <- c(length(blogs.lines))
blogs.words <- sum(stri_count_words(blogs.lines))
blogs.characters <- sum(nchar(blogs.lines))
news.size <- round(file.info(news.file)$size / 1024^2, digits=2)
news.lines <- readLines(news.file, skipNul = TRUE)
news.length <- c(length(news.lines))
news.words <- sum(stri_count_words(news.lines))
news.characters <- sum(nchar(news.lines))
twitter.size <- round(file.info(twitter.file)$size / 1024^2, digits=2)
twitter.lines <- readLines(twitter.file, skipNul = TRUE)
twitter.length <- c(length(twitter.lines))
twitter.words <- sum(stri_count_words(twitter.lines))
twitter.characters <- sum(nchar(twitter.lines))
summary.files <- c(substring(blogs.file, 20, 35), substring(news.file, 20, 35), substring(twitter.file, 20, 36))
summary.sizes <- c(blogs.size, news.size, twitter.size)
summary.lines <- c(blogs.length, news.length, twitter.length)
summary.words <- c(blogs.words, news.words, twitter.words)
summary.characters <- c(blogs.characters, news.characters, twitter.characters)
data.summary <- as.data.frame(cbind(summary.files, summary.sizes, summary.lines, summary.words, summary.characters))
colnames(data.summary) <- c("File", "Size (Mb)", "Lines", "Words", "Characters")
kable(data.summary,
format="markdown",
caption="American English Corpora Summary Statistics")
corpus <- VCorpus(VectorSource(training.all))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
unigram.tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm.unigram <- TermDocumentMatrix(corpus, control = list(tokenizer=unigram.tokenizer))
tdm.unigram <- removeSparseTerms(tdm.unigram, 0.9999)
wcloud <- as.matrix(tdm.unigram)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(12, "Set3")
# Create unigram word cloud.
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=pal,scale=c(1.25, 0.9))
tdm.unigram.freq <- sort(rowSums(as.matrix(tdm.unigram)), decreasing=TRUE)
head(data.frame(word=names(tdm.unigram.freq), freq=tdm.unigram.freq), 50) %>%
ggplot(., aes(x=reorder(word, -freq),freq)) +
geom_bar(stat="identity",colour="blue",fill="blue") +
ggtitle("Unigrams with the highest frequencies") +
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))
bigram.tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = " \\n\\t\\r"))
tdm.bigram <- TermDocumentMatrix(corpus, control = list(tokenizer=bigram.tokenizer))
tdm.bigram <- removeSparseTerms(tdm.bigram , 0.9999)
wcloud <- as.matrix(tdm.bigram)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(12, "Set3")
# Create unigram word cloud.
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.35,
colors=pal,scale=c(1.25, 0.9))
tdm.bigram.freq <- sort(rowSums(as.matrix(tdm.bigram)), decreasing=TRUE)
head(data.frame(word=names(tdm.bigram.freq), freq=tdm.bigram.freq), 50) %>%
ggplot(., aes(x=reorder(word, -freq),freq)) +
geom_bar(stat="identity",colour="blue",fill="blue") +
ggtitle("Bigrams with the highest frequencies") +
xlab("Bigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))
trigram.tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = " \\n\\t\\r"))
tdm.trigram <- TermDocumentMatrix(corpus, control = list(tokenize=trigram.tokenizer))
tdm.trigram <- removeSparseTerms(tdm.trigram, 0.9999)
wcloud <- as.matrix(tdm.trigram)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(12, "Paired")
# Create unigram word cloud.
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.35,
colors=pal,scale=c(1.25, 0.9))
tdm.trigram.freq <- sort(rowSums(as.matrix(tdm.trigram)), decreasing=TRUE)
head(data.frame(word=names(tdm.trigram.freq), freq=tdm.trigram.freq), 50) %>%
ggplot(., aes(x=reorder(word, -freq),freq)) +
geom_bar(stat="identity",colour="blue",fill="blue") +
ggtitle("Trigrams with the highest frequencies") +
xlab("Trigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))