Coursera Data Science Capstone Project: Milestone Report

Introduction

This report presents a milestone review covering initial exploratory data analysis on the supplied Swiftkey corpora for the Coursera Data Science Capstone project. The purpose of the report is to:

Demonstrate that the data has been downloaded and successfully loaded
Basic summary statistics about the data sets are presented
Detail any interesting findings
Solicit feedback on the plans for creating a prediction algorithm and Shiny app using the corpora

Data Aquisition

The capstone dataset is a corpora supplied for the Data Science Capstone from the Coursera Data Science Specialization. The dataset was obtained directly from the course website.

The dataset is provided as a NA MB zip archive.

Exploratory Analysis

Within the archive there is a corpus for each of the locales NA, NA, NA and NA. For each locale a corpus consisting of text from blogs, news and twitter sources is provided. While there are multiple corpus provided the focus of the initial analysis and exploration was on the American English (en_US) file set.

Once the corpus has been loaded some basic statistics are gathered for each of the files. The RWeka package is then used to extract the most frequently occuring n-grams and these are used to provide an overview of the commonly occuring n-grams in the data with a view to establishing teh soundness of using the n-gram frequences as the basis for a predictive model.

File	Size (Mb)	Lines	Words	Characters
en_US.blogs.txt	200.42	899288	38154238	208361438
en_US.news.txt	196.28	77259	2693898	15683765
en_US.twitter.txt	159.36	2360148	30218166	162385035

Modelling

Generate Training Dataset

The supplied dataset is too large to perform analysis on in a timely fashion. To perform the initial analysis a training set consisting of a random sample of 1% of the individual files within the corpora were selected.

Create and Clean Corpus

To be able to understand the data and use it to explore and build a predictive model we need to clean the data. We do so using the filtered training set as a starting point and strip out data which does not directly contribute to our understanding of the data. We do this by:

Loading the initial corpus from the consolidated tradining set which represents 1% of each of the files (blogs, news, twitter) in the corpus.
Remove all punctuation
Remove numbers
Canonicalize the model by converting to all lowercase
Remove English stop words
Strip whitespace from the corpus
Flag the corpus as a standard text document

After a sample of data has been cleaned and loaded we are able to perform additional analysis by decomposing the sample text into n-grams (1-, 2- and 3-) and examining the frequency of the top 50 n-grams. For each of the n-grams the top 50 terms by frequency are plotted along with word clouds to give an alternate representation.

Unigrams

Bigrams

Trigams

Next Steps

Given the initial exploratory analysis it appears that n-gram models will provide a sound basis for a predictive model to determine candidate next words when constructing a sentence.

To build the predictive model we will analysis additional portions of the corpus. The accuracy of the frequency matrix will be a tradeoff against the memory utilisation and time to process. Given the concentration of coverage in more frequently occuring terms and the long tail of n-grams with a low number of occurences we may wish to discard words which are only seldom used.

Once we have constructed the predictive model we will make it available in an interactive Shiny application which is able to predict the next word required when entering text.

Apendix

The R source code used to perform the preliminary analysis follows:

Initialisation

library(downloader)
library(tm)
library(SnowballC)
library(stringi)
library(knitr)
library(wordcloud)
library(RWeka)
library(magrittr)
library(ggplot2)

Acquire Data

if (!file.exists('data')) {
        dir.create(file.path(getwd(), 'data'))
}       

setwd(file.path(getwd(), './data'))

# initial data set
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
file <- "Coursera-SwiftKey.zip"

if (!file.exists(file)) {
        download(url, file)
        unzip(file)
}

Exploratory Analysis

data.locale <- "en_US"
data.location <- paste(data.root, data.locale, sep="/")

blogs.file <- paste(data.location, "/", data.locale, ".blogs.txt", sep="")
news.file <- paste(data.location,  "/", data.locale, ".news.txt", sep="")
twitter.file <- paste(data.location,  "/", data.locale, ".twitter.txt", sep="")

blogs.size <- round(file.info(blogs.file)$size / 1024^2, digits=2)
blogs.lines <- readLines(blogs.file, skipNul = TRUE)
blogs.length <- c(length(blogs.lines))
blogs.words <- sum(stri_count_words(blogs.lines))
blogs.characters <- sum(nchar(blogs.lines))


news.size <- round(file.info(news.file)$size / 1024^2, digits=2)
news.lines <- readLines(news.file, skipNul = TRUE)
news.length <- c(length(news.lines))
news.words <- sum(stri_count_words(news.lines))
news.characters <- sum(nchar(news.lines))

twitter.size <- round(file.info(twitter.file)$size / 1024^2, digits=2)
twitter.lines <- readLines(twitter.file, skipNul = TRUE)
twitter.length <- c(length(twitter.lines))
twitter.words <- sum(stri_count_words(twitter.lines))
twitter.characters <- sum(nchar(twitter.lines))

summary.files <- c(substring(blogs.file, 20, 35), substring(news.file, 20, 35), substring(twitter.file, 20, 36))
summary.sizes <- c(blogs.size, news.size, twitter.size)
summary.lines <- c(blogs.length, news.length, twitter.length)
summary.words <- c(blogs.words, news.words, twitter.words)
summary.characters <- c(blogs.characters, news.characters, twitter.characters)

data.summary <- as.data.frame(cbind(summary.files, summary.sizes, summary.lines, summary.words, summary.characters))
colnames(data.summary) <- c("File", "Size (Mb)", "Lines", "Words", "Characters")

kable(data.summary,
      format="markdown",
      caption="American English Corpora Summary Statistics")

Create and Clean Corpus

corpus <- VCorpus(VectorSource(training.all))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Unigrams

unigram.tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm.unigram <- TermDocumentMatrix(corpus, control = list(tokenizer=unigram.tokenizer))
tdm.unigram <- removeSparseTerms(tdm.unigram, 0.9999)

wcloud <- as.matrix(tdm.unigram)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(12, "Set3")

# Create unigram word cloud.
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=pal,scale=c(1.25, 0.9))

tdm.unigram.freq <- sort(rowSums(as.matrix(tdm.unigram)), decreasing=TRUE)  

head(data.frame(word=names(tdm.unigram.freq), freq=tdm.unigram.freq), 50) %>%
  ggplot(., aes(x=reorder(word, -freq),freq)) +
  geom_bar(stat="identity",colour="blue",fill="blue") +
  ggtitle("Unigrams with the highest frequencies") +
  xlab("Unigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Bigrams

bigram.tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = " \\n\\t\\r"))

tdm.bigram <- TermDocumentMatrix(corpus, control = list(tokenizer=bigram.tokenizer))
tdm.bigram <- removeSparseTerms(tdm.bigram , 0.9999)


wcloud <- as.matrix(tdm.bigram)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(12, "Set3")

# Create unigram word cloud.
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=pal,scale=c(1.25, 0.9))

tdm.bigram.freq <- sort(rowSums(as.matrix(tdm.bigram)), decreasing=TRUE)  

head(data.frame(word=names(tdm.bigram.freq), freq=tdm.bigram.freq), 50) %>%
  ggplot(., aes(x=reorder(word, -freq),freq)) +
  geom_bar(stat="identity",colour="blue",fill="blue") +
  ggtitle("Bigrams with the highest frequencies") +
  xlab("Bigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Trigams

trigram.tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = " \\n\\t\\r"))

tdm.trigram <- TermDocumentMatrix(corpus, control = list(tokenize=trigram.tokenizer))
tdm.trigram <- removeSparseTerms(tdm.trigram, 0.9999)

wcloud <- as.matrix(tdm.trigram)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(12, "Paired")

# Create unigram word cloud.
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=pal,scale=c(1.25, 0.9))

tdm.trigram.freq <- sort(rowSums(as.matrix(tdm.trigram)), decreasing=TRUE)  

head(data.frame(word=names(tdm.trigram.freq), freq=tdm.trigram.freq), 50) %>%
  ggplot(., aes(x=reorder(word, -freq),freq)) +
  geom_bar(stat="identity",colour="blue",fill="blue") +
  ggtitle("Trigrams with the highest frequencies") +
  xlab("Trigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Coursera Data Science Capstone Project: Milestone Report

David Galbraith

19 February 2017

Introduction

Data Aquisition

Exploratory Analysis

Modelling

Generate Training Dataset

Create and Clean Corpus

Unigrams

Bigrams

Trigams

Next Steps

Apendix

Initialisation

Acquire Data

Exploratory Analysis

Create and Clean Corpus

Unigrams

Bigrams

Trigams