Introduction

This report includes a brief exploratory data analysis of the data set that will used to train a word prediction model using natural language processing techniques. The purpose of the word prediction model is to provide functionality in a text communication solution to predict the next word the user could write.

A basic summary of the dataset with statistics is shown below and also plots of the word frequency and word cloud.

Load Data

The data for this project is from a corpus called HC Corpora and can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

The dataset consists of blog posts, news articles & Twitter tweets and come in 4 different languages: English (US), Finish, Russian & German.

In this report, we’ll only analyze the text files in English (US):

gcinfo(verbose = FALSE)

blogs <- readLines(file.path("data", "src", "en_US", "en_US.blogs.txt"))
news <- readLines(file.path("data", "src", "en_US", "en_US.news.txt"))
twitter <- readLines(file.path("data", "src", "en_US", "en_US.twitter.txt"), skipNul = TRUE)

Below is a basic summary of the statistics for the files:

File LineCount WordCount AvgCharsPerLine
blogs 899288 37546246 230
news 1010242 34762395 201
twitter 2360148 30093410 69

As can be seen from the summary above, the data set is quite large, so we’ll use a subset of the dataset to speed up the processing and get approximate results by randomly selecting 1% of the full dataset.

Clean Sample Data

After preliminary exploration of the data set, the following strategy was selected for cleaning & transforming the data to provide for optimal training of the word prediction model:

library(tm)

# take a 1% of the dataset
set.seed(123)
sample_rate <- 0.01
blog_samples <- sample(blogs, length(blogs) * sample_rate, replace=FALSE)
news_samples <- sample(news, length(news) * sample_rate, replace=FALSE)
twitter_samples <- sample(twitter, length(twitter) * sample_rate, replace=FALSE)

sample_data <- append(blog_samples, news_samples)
sample_data <- append(sample_data, twitter_samples)

rm(blogs, news, twitter)

# create corpus
corpus <- Corpus(VectorSource(sample_data))

# remove swear words
profanity <- c(t(read.csv("./swearWords.csv",header=F)))
corpus <- tm_map(corpus, removeWords, profanity)

# Replace blank space (“rt”)
corpus <- tm_map(corpus, function(x) {x <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", x)})

# Replace @UserName
corpus <- tm_map(corpus, function(x) {x <- gsub("@\\w+", "", x)})

# Remove links
corpus <- tm_map(corpus, function(x) {x <- gsub("http\\w+", "", x)})

# lower case, remove punctuation, numbers, whitespace
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(trim))

Sample Data Word Distribution

Next we tokenize the data and create n-gram frequency tables for uni, bi & trigrams to see the distribution of the terms of the dataset and plot the histograms & wordcloud for each n-gram.

A n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a N-1 order Markob model.

library(tm)
library(RWeka)
library(parallel)

# fix for invalid 'times' error on OSX, 
options(mc.cores=1)

tdm <- function(corpus, n=1) {
  control <- list(tokenize=function(x) NGramTokenizer(x, Weka_control(min=n,max=n)))
  TermDocumentMatrix(corpus, control=control)
}

tdm_1 <- tdm(corpus, 1)
tdm_2 <- tdm(corpus, 2)
tdm_3 <- tdm(corpus, 3)

options(mc.cores=detectCores()-1)

Unigram

The distribution of the frequencies of each unigram (single word) can be seen below and as expected we see a high concentration of stop words.

The wordcloud also visualize the word frequency distribution and also here we see a lot of stop words.

Bigram

Bigram is a sequence of two adjacent words and the distribution of the word pair can be seen below in the histogram plot and word cloud.

library(ggplot2)
library(wordcloud)
library(slam)

termFreq <- rowSums(as.matrix(tdm_2[which(row_sums(tdm_2) > 10),]))
termFreqTop25 <- termFreq[head(order(termFreq, decreasing=T), 25)]
wf <- data.frame(word=names(termFreqTop25), freq=termFreqTop25)

ggplot(wf, aes(reorder(word, -freq), freq)) +
geom_bar(stat="identity", fill="blue") +
xlab("Word") +
ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))

# plot wordcloud
set.seed(123)
wordcloud(words=names(termFreq), freq=termFreq, max.words = 150, random.order=F, colors=brewer.pal(6, "Dark2"))

Trigram

A trigram is a sequence of three adjacent words and the distribution of the frequencies can be seen below.

library(ggplot2)
library(wordcloud)
library(slam)

termFreq <- rowSums(as.matrix(tdm_3[which(row_sums(tdm_3) > 2),]))
termFreqTop25 <- termFreq[head(order(termFreq, decreasing=T), 25)]
wf <- data.frame(word=names(termFreqTop25), freq=termFreqTop25)

ggplot(wf, aes(reorder(word, -freq), freq)) +
geom_bar(stat="identity", fill="blue") +
xlab("Word") +
ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))

# plot wordcloud
set.seed(123)
wordcloud(words=names(termFreq), freq=termFreq, max.words = 50, colors=brewer.pal(6, "Dark2"))
## Warning in wordcloud(words = names(termFreq), freq = termFreq, max.words =
## 50, : thanks for the could not be fit on page. It will not be plotted.

Summary

As can be seen in the exploratory analysis the so called “stopwords” are amongst the most frequest words. We will still keep them as they play an important role of the structure of the sentence.

The sample data set was randomly picked from the raw data and even though we only sampled 1% it still took a long time to process the n-grams, especially the higher order n-grams. So for training the final word prediction model, we anticipate splitting the training data set into multiple partition and aggregate and then merge the aggregates into a final data set like MapReduce.

When training the final word prediction model, we will consider implementing smoothing techniqes like Good-Turing and add-one Laplacian to discount probability of seen n-grams to increase prediction accuracy.

We will also consider augmenting the word prediction model with backoff (go from trigram to bigram to unigram) and also linear interpolation.