Data Science Capstone - Week 2 Milestone

Project Overview

The goal of this project is to familiarize myself with the Swiftkey Company Dataset (can be downloaded here). The project itself consists of developing a predictive model text (Predictive Text Analysis). Throughout this report, I will divide it into several steps:

Tools & Libraries
Dataset
Preprocessing & Cleaning the Data
Brief Data Analysis

1. Tools & Libraries

There are basically three core packages from R that are used in this project:

tm: The tm package a.k.a the “Text Mining”. Useful for anything in text processing related.
wordcloud2: Easier and neater visualization with word cloud.
quanteda: Quantitative analysis of textual data

library(knitr)
library(ggplot2)
library(tm)
library(wordcloud2)
library(quanteda)
library(ngram)

set.seed(101)

2. Dataset

If you extract the downloaded zipped file, there are several folders and files. We are particularly interested in the English ones. Inside it, there are three text files named en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt. These are the files that we will be using in this project. Due to the complexity and memory limit, I only sample around ~1000-2000 lines for each file. Loading all lines for all three files would definitely use up your entire memory.

blog_path <- './final/en_US/en_US.blogs.txt'
news_path <- './final/en_US/en_US.news.txt'
twitter_path <- './final/en_US/en_US.twitter.txt'

sampling = T
if (sampling == T) {
  n_samples <- as.integer(runif(1, 1000, 2000))
  blog <- readLines(blog_path, n = n_samples, encoding='UTF-8')
  news <- readLines(news_path, n = n_samples, encoding='UTF-8')
  twitter <- readLines(twitter_path, n = n_samples, encoding='UTF-8')
} else {
  blog <- readLines(blog_path, encoding='UTF-8')
  news <- readLines(news_path, encoding='UTF-8')
  twitter <- readLines(twitter_path, encoding='UTF-8')
}

Let’s see how much memory I loaded in the previous step for each file with the sampling method:

blog_corpus <- corpus(blog)
news_corpus <- corpus(news)
twitter_corpus <- corpus(twitter)
corpus_collection <- data.frame("File" = c("Blogs", "News", "Twitter"),
                      "File Size" = sapply(list(blog, news, twitter), function(x){format(object.size(x),"MB")}),
                      "Number of lines" = sapply(list(blog_corpus, news_corpus, twitter_corpus), function(x){ndoc(x)}),
                      "Number of Words" = sapply(list(blog, news, twitter), function(x){wordcount(x)})
)
corpus_collection

##      File File.Size Number.of.lines Number.of.Words
## 1   Blogs    0.4 Mb            1372           56452
## 2    News    0.3 Mb            1372           45502
## 3 Twitter    0.2 Mb            1372           17540

Preprocessing & Cleaning The Data

Before proceeding to the data analysis part, we need to preprocess and clean the data. This step includes but not limited to:

Removing punctuations a.k.a removing non-alphanumeric characters.
Convert all words to lowercase
Removing stopwords. Stopwords are words that usually do not give significant meaning in a sentence/paragraph (e.g. “a”, “the”, “is”, etc.)

Next, we can create an entire “corpus” for our word database. Since our final goal project is to build a predictive model for the Swiftkey keyboard, the n-gram model is a great starting model. The “n-gram” means we split words into n pairs. For example if we encounter a sentence “The big brown fox jumps over the lazy dog” and we set n=3, then we will get a 3-gram model. Our corpus will turn into something like “The big brown”, “big brown fox”, “brown fox jumps”, and so forth. Let’s look at the “blog” corpus first.

blog_token <- tokens(blog_corpus, remove_punct=T)
blog_token <- tokens_remove(blog_token, pattern=stopwords('english'))

unigrams <- tokens_ngrams(blog_token, n=1)
top_unigrams <- topfeatures(dfm(unigrams), 20)
bigrams <- tokens_ngrams(blog_token, n=2)
top_bigrams <- topfeatures(dfm(bigrams), 20)
trigrams <- tokens_ngrams(blog_token, n=3)
top_trigrams <- topfeatures(dfm(trigrams), 20)

Here, I only set the limit of the n to three. We will see why in the data analysis part.

Data Analysis

Wordcloud is an easy way to look at which words or pairs of words that show frequently in the text. Another way is to look at the histogram. To avoid confusion, I only showed the top 20 words that appear in the text.

temp = data.frame(word=names(top_unigrams), freq=top_unigrams)
wordcloud2(temp)

par(mar=c(4,4,4,4))
barplot(height = top_unigrams, names.arg = names(top_unigrams),
 las = 2, main = "Most Common Uni-gram Frequency on Blog files")

temp = data.frame(word=names(top_bigrams), freq=top_bigrams)
wordcloud2(temp)

par(mar=c(8,4,4,4))
barplot(height = top_bigrams, names.arg = names(top_bigrams),
 las = 2, main = "Most Common Bi-gram Frequency on Blog files")

temp = data.frame(word=names(top_trigrams), freq=top_trigrams)
wordcloud2(temp)

par(mar=c(12,4,4,4))
barplot(height = top_trigrams, names.arg = names(top_trigrams),
 las = 2, main = "Most Common Tri-gram Frequency on Blog files")

Now, let’s look at the “news” corpus.

news_token <- tokens(news_corpus, remove_punct=T)
news_token <- tokens_remove(news_token, pattern=stopwords('english'))

unigrams <- tokens_ngrams(news_token, n=1)
top_unigrams <- topfeatures(dfm(unigrams), 20)
bigrams <- tokens_ngrams(news_token, n=2)
top_bigrams <- topfeatures(dfm(bigrams), 20)
trigrams <- tokens_ngrams(news_token, n=3)
top_trigrams <- topfeatures(dfm(trigrams), 20)

temp = data.frame(word=names(top_unigrams), freq=top_unigrams)
wordcloud2(temp)

par(mar=c(4,4,4,4))
barplot(height = top_unigrams, names.arg = names(top_unigrams),
 las = 2, main = "Most Common Uni-gram Frequency on News files")

temp = data.frame(word=names(top_bigrams), freq=top_bigrams)
wordcloud2(temp)

par(mar=c(8,4,4,4))
barplot(height = top_bigrams, names.arg = names(top_bigrams),
 las = 2, main = "Most Common Bi-gram Frequency on News files")

temp = data.frame(word=names(top_trigrams), freq=top_trigrams)
wordcloud2(temp)

par(mar=c(12,4,4,4))
barplot(height = top_trigrams, names.arg = names(top_trigrams),
 las = 2, main = "Most Common Tri-gram Frequency on News files")

Finally, the “twitter” corpus.

twitter_token <- tokens(twitter_corpus, remove_punct=T)
twitter_token <- tokens_remove(twitter_token, pattern=stopwords('english'))

unigrams <- tokens_ngrams(twitter_token, n=1)
top_unigrams <- topfeatures(dfm(unigrams), 20)
bigrams <- tokens_ngrams(twitter_token, n=2)
top_bigrams <- topfeatures(dfm(bigrams), 20)
trigrams <- tokens_ngrams(twitter_token, n=3)
top_trigrams <- topfeatures(dfm(trigrams), 20)

temp = data.frame(word=names(top_unigrams), freq=top_unigrams)
wordcloud2(temp)

par(mar=c(4,4,4,4))
barplot(height = top_unigrams, names.arg = names(top_unigrams),
 las = 2, main = "Most Common Uni-gram Frequency on Twitter files")

temp = data.frame(word=names(top_bigrams), freq=top_bigrams)
wordcloud2(temp)

par(mar=c(8,4,4,4))
barplot(height = top_bigrams, names.arg = names(top_bigrams),
 las = 2, main = "Most Common Bi-gram Frequency on Twitter files")

temp = data.frame(word=names(top_trigrams), freq=top_trigrams)
wordcloud2(temp)

par(mar=c(12,4,4,4))
barplot(height = top_trigrams, names.arg = names(top_trigrams),
 las = 2, main = "Most Common Tri-gram Frequency on Twitter files")