Exploratory Data Analysis of the SwiftKey Capstone Data

Introduction

This report includes a brief exploratory data analysis of the data set that will used to train a word prediction model using natural language processing techniques. The purpose of the word prediction model is to provide functionality in a text communication solution to predict the next word the user could write.

A basic summary of the dataset with statistics is shown below and also plots of the word frequency and word cloud.

Load Data

The data for this project is from a corpus called HC Corpora and can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

The dataset consists of blog posts, news articles & Twitter tweets and come in 4 different languages: English (US), Finish, Russian & German.

In this report, we’ll only analyze the text files in English (US):

en_US.blogs.txt (210MB)
en_US.news.txt (206MB)
en_US.twitter.txt (167MB)

gcinfo(verbose = FALSE)

blogs <- readLines(file.path("data", "src", "en_US", "en_US.blogs.txt"))
news <- readLines(file.path("data", "src", "en_US", "en_US.news.txt"))
twitter <- readLines(file.path("data", "src", "en_US", "en_US.twitter.txt"), skipNul = TRUE)

Below is a basic summary of the statistics for the files:

File	LineCount	WordCount	AvgCharsPerLine
blogs	899288	37546246	230
news	1010242	34762395	201
twitter	2360148	30093410	69

As can be seen from the summary above, the data set is quite large, so we’ll use a subset of the dataset to speed up the processing and get approximate results by randomly selecting 1% of the full dataset.

Clean Sample Data

After preliminary exploration of the data set, the following strategy was selected for cleaning & transforming the data to provide for optimal training of the word prediction model:

Words will be transform to lower case
Number will be removed as they add very little value in predicting the next word
Profanity words are removed as we don’t want any swear words in the word prediction model
Punctuation will be removed to reduce the number of ngrams
Stemming will not be done as we want to predict the correct word form and not the base / lemma
Stop words will be included as in this case they are treated as any other word the user could type next
Twitter characters will be removed (@#)
Whitespace will be removed as its not useful for word prediction

library(tm)

# take a 1% of the dataset
set.seed(123)
sample_rate <- 0.01
blog_samples <- sample(blogs, length(blogs) * sample_rate, replace=FALSE)
news_samples <- sample(news, length(news) * sample_rate, replace=FALSE)
twitter_samples <- sample(twitter, length(twitter) * sample_rate, replace=FALSE)

sample_data <- append(blog_samples, news_samples)
sample_data <- append(sample_data, twitter_samples)

rm(blogs, news, twitter)

# create corpus
corpus <- Corpus(VectorSource(sample_data))

# remove swear words
profanity <- c(t(read.csv("./swearWords.csv",header=F)))
corpus <- tm_map(corpus, removeWords, profanity)

# Replace blank space (“rt”)
corpus <- tm_map(corpus, function(x) {x <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", x)})

# Replace @UserName
corpus <- tm_map(corpus, function(x) {x <- gsub("@\\w+", "", x)})

# Remove links
corpus <- tm_map(corpus, function(x) {x <- gsub("http\\w+", "", x)})

# lower case, remove punctuation, numbers, whitespace
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(trim))

Sample Data Word Distribution

Next we tokenize the data and create n-gram frequency tables for uni, bi & trigrams to see the distribution of the terms of the dataset and plot the histograms & wordcloud for each n-gram.

A n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a N-1 order Markob model.

library(tm)
library(RWeka)
library(parallel)

# fix for invalid 'times' error on OSX, 
options(mc.cores=1)

tdm <- function(corpus, n=1) {
  control <- list(tokenize=function(x) NGramTokenizer(x, Weka_control(min=n,max=n)))
  TermDocumentMatrix(corpus, control=control)
}

tdm_1 <- tdm(corpus, 1)
tdm_2 <- tdm(corpus, 2)
tdm_3 <- tdm(corpus, 3)

options(mc.cores=detectCores()-1)

Unigram

The distribution of the frequencies of each unigram (single word) can be seen below and as expected we see a high concentration of stop words.

The wordcloud also visualize the word frequency distribution and also here we see a lot of stop words.

Bigram

Bigram is a sequence of two adjacent words and the distribution of the word pair can be seen below in the histogram plot and word cloud.

library(ggplot2)
library(wordcloud)
library(slam)

termFreq <- rowSums(as.matrix(tdm_2[which(row_sums(tdm_2) > 10),]))
termFreqTop25 <- termFreq[head(order(termFreq, decreasing=T), 25)]
wf <- data.frame(word=names(termFreqTop25), freq=termFreqTop25)

ggplot(wf, aes(reorder(word, -freq), freq)) +
geom_bar(stat="identity", fill="blue") +
xlab("Word") +
ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))

# plot wordcloud
set.seed(123)
wordcloud(words=names(termFreq), freq=termFreq, max.words = 150, random.order=F, colors=brewer.pal(6, "Dark2"))

Trigram

A trigram is a sequence of three adjacent words and the distribution of the frequencies can be seen below.

library(ggplot2)
library(wordcloud)
library(slam)

termFreq <- rowSums(as.matrix(tdm_3[which(row_sums(tdm_3) > 2),]))
termFreqTop25 <- termFreq[head(order(termFreq, decreasing=T), 25)]
wf <- data.frame(word=names(termFreqTop25), freq=termFreqTop25)

ggplot(wf, aes(reorder(word, -freq), freq)) +
geom_bar(stat="identity", fill="blue") +
xlab("Word") +
ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))

# plot wordcloud
set.seed(123)
wordcloud(words=names(termFreq), freq=termFreq, max.words = 50, colors=brewer.pal(6, "Dark2"))

## Warning in wordcloud(words = names(termFreq), freq = termFreq, max.words =
## 50, : thanks for the could not be fit on page. It will not be plotted.

Summary

As can be seen in the exploratory analysis the so called “stopwords” are amongst the most frequest words. We will still keep them as they play an important role of the structure of the sentence.

The sample data set was randomly picked from the raw data and even though we only sampled 1% it still took a long time to process the n-grams, especially the higher order n-grams. So for training the final word prediction model, we anticipate splitting the training data set into multiple partition and aggregate and then merge the aggregates into a final data set like MapReduce.

When training the final word prediction model, we will consider implementing smoothing techniqes like Good-Turing and add-one Laplacian to discount probability of seen n-grams to increase prediction accuracy.

We will also consider augmenting the word prediction model with backoff (go from trigram to bigram to unigram) and also linear interpolation.