Introduction

This report presents a milestone review covering initial exploratory data analysis on the supplied Swiftkey corpora for the Coursera Data Science Capstone project. The purpose of the report is to:

  1. Demonstrate that the data has been downloaded and successfully loaded
  2. Basic summary statistics about the data sets are presented
  3. Detail any interesting findings
  4. Solicit feedback on the plans for creating a prediction algorithm and Shiny app using the corpora

Data Aquisition

The capstone dataset is a corpora supplied for the Data Science Capstone from the Coursera Data Science Specialization. The dataset was obtained directly from the course website.

The dataset is provided as a NA MB zip archive.

Exploratory Analysis

Within the archive there is a corpus for each of the locales NA, NA, NA and NA. For each locale a corpus consisting of text from blogs, news and twitter sources is provided. While there are multiple corpus provided the focus of the initial analysis and exploration was on the American English (en_US) file set.

Once the corpus has been loaded some basic statistics are gathered for each of the files. The RWeka package is then used to extract the most frequently occuring n-grams and these are used to provide an overview of the commonly occuring n-grams in the data with a view to establishing teh soundness of using the n-gram frequences as the basis for a predictive model.

File Size (Mb) Lines Words Characters
en_US.blogs.txt 200.42 899288 38154238 208361438
en_US.news.txt 196.28 77259 2693898 15683765
en_US.twitter.txt 159.36 2360148 30218166 162385035

Modelling

Generate Training Dataset

The supplied dataset is too large to perform analysis on in a timely fashion. To perform the initial analysis a training set consisting of a random sample of 1% of the individual files within the corpora were selected.

Create and Clean Corpus

To be able to understand the data and use it to explore and build a predictive model we need to clean the data. We do so using the filtered training set as a starting point and strip out data which does not directly contribute to our understanding of the data. We do this by:

  1. Loading the initial corpus from the consolidated tradining set which represents 1% of each of the files (blogs, news, twitter) in the corpus.
  2. Remove all punctuation
  3. Remove numbers
  4. Canonicalize the model by converting to all lowercase
  5. Remove English stop words
  6. Strip whitespace from the corpus
  7. Flag the corpus as a standard text document

After a sample of data has been cleaned and loaded we are able to perform additional analysis by decomposing the sample text into n-grams (1-, 2- and 3-) and examining the frequency of the top 50 n-grams. For each of the n-grams the top 50 terms by frequency are plotted along with word clouds to give an alternate representation.

Unigrams

Bigrams

Trigams

Next Steps

Given the initial exploratory analysis it appears that n-gram models will provide a sound basis for a predictive model to determine candidate next words when constructing a sentence.

To build the predictive model we will analysis additional portions of the corpus. The accuracy of the frequency matrix will be a tradeoff against the memory utilisation and time to process. Given the concentration of coverage in more frequently occuring terms and the long tail of n-grams with a low number of occurences we may wish to discard words which are only seldom used.

Once we have constructed the predictive model we will make it available in an interactive Shiny application which is able to predict the next word required when entering text.

Apendix

The R source code used to perform the preliminary analysis follows:

Initialisation

library(downloader)
library(tm)
library(SnowballC)
library(stringi)
library(knitr)
library(wordcloud)
library(RWeka)
library(magrittr)
library(ggplot2)

Acquire Data

if (!file.exists('data')) {
        dir.create(file.path(getwd(), 'data'))
}       

setwd(file.path(getwd(), './data'))

# initial data set
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
file <- "Coursera-SwiftKey.zip"

if (!file.exists(file)) {
        download(url, file)
        unzip(file)
}

Exploratory Analysis

data.locale <- "en_US"
data.location <- paste(data.root, data.locale, sep="/")

blogs.file <- paste(data.location, "/", data.locale, ".blogs.txt", sep="")
news.file <- paste(data.location,  "/", data.locale, ".news.txt", sep="")
twitter.file <- paste(data.location,  "/", data.locale, ".twitter.txt", sep="")

blogs.size <- round(file.info(blogs.file)$size / 1024^2, digits=2)
blogs.lines <- readLines(blogs.file, skipNul = TRUE)
blogs.length <- c(length(blogs.lines))
blogs.words <- sum(stri_count_words(blogs.lines))
blogs.characters <- sum(nchar(blogs.lines))


news.size <- round(file.info(news.file)$size / 1024^2, digits=2)
news.lines <- readLines(news.file, skipNul = TRUE)
news.length <- c(length(news.lines))
news.words <- sum(stri_count_words(news.lines))
news.characters <- sum(nchar(news.lines))

twitter.size <- round(file.info(twitter.file)$size / 1024^2, digits=2)
twitter.lines <- readLines(twitter.file, skipNul = TRUE)
twitter.length <- c(length(twitter.lines))
twitter.words <- sum(stri_count_words(twitter.lines))
twitter.characters <- sum(nchar(twitter.lines))

summary.files <- c(substring(blogs.file, 20, 35), substring(news.file, 20, 35), substring(twitter.file, 20, 36))
summary.sizes <- c(blogs.size, news.size, twitter.size)
summary.lines <- c(blogs.length, news.length, twitter.length)
summary.words <- c(blogs.words, news.words, twitter.words)
summary.characters <- c(blogs.characters, news.characters, twitter.characters)

data.summary <- as.data.frame(cbind(summary.files, summary.sizes, summary.lines, summary.words, summary.characters))
colnames(data.summary) <- c("File", "Size (Mb)", "Lines", "Words", "Characters")

kable(data.summary,
      format="markdown",
      caption="American English Corpora Summary Statistics")

Create and Clean Corpus

corpus <- VCorpus(VectorSource(training.all))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Unigrams

unigram.tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm.unigram <- TermDocumentMatrix(corpus, control = list(tokenizer=unigram.tokenizer))
tdm.unigram <- removeSparseTerms(tdm.unigram, 0.9999)

wcloud <- as.matrix(tdm.unigram)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(12, "Set3")

# Create unigram word cloud.
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=pal,scale=c(1.25, 0.9))

tdm.unigram.freq <- sort(rowSums(as.matrix(tdm.unigram)), decreasing=TRUE)  

head(data.frame(word=names(tdm.unigram.freq), freq=tdm.unigram.freq), 50) %>%
  ggplot(., aes(x=reorder(word, -freq),freq)) +
  geom_bar(stat="identity",colour="blue",fill="blue") +
  ggtitle("Unigrams with the highest frequencies") +
  xlab("Unigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Bigrams

bigram.tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = " \\n\\t\\r"))

tdm.bigram <- TermDocumentMatrix(corpus, control = list(tokenizer=bigram.tokenizer))
tdm.bigram <- removeSparseTerms(tdm.bigram , 0.9999)


wcloud <- as.matrix(tdm.bigram)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(12, "Set3")

# Create unigram word cloud.
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=pal,scale=c(1.25, 0.9))

tdm.bigram.freq <- sort(rowSums(as.matrix(tdm.bigram)), decreasing=TRUE)  

head(data.frame(word=names(tdm.bigram.freq), freq=tdm.bigram.freq), 50) %>%
  ggplot(., aes(x=reorder(word, -freq),freq)) +
  geom_bar(stat="identity",colour="blue",fill="blue") +
  ggtitle("Bigrams with the highest frequencies") +
  xlab("Bigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Trigams

trigram.tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = " \\n\\t\\r"))

tdm.trigram <- TermDocumentMatrix(corpus, control = list(tokenize=trigram.tokenizer))
tdm.trigram <- removeSparseTerms(tdm.trigram, 0.9999)

wcloud <- as.matrix(tdm.trigram)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(12, "Paired")

# Create unigram word cloud.
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=pal,scale=c(1.25, 0.9))

tdm.trigram.freq <- sort(rowSums(as.matrix(tdm.trigram)), decreasing=TRUE)  

head(data.frame(word=names(tdm.trigram.freq), freq=tdm.trigram.freq), 50) %>%
  ggplot(., aes(x=reorder(word, -freq),freq)) +
  geom_bar(stat="identity",colour="blue",fill="blue") +
  ggtitle("Trigrams with the highest frequencies") +
  xlab("Trigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))