Capstone Introduction

The goal of this capstone is to mimic the experience of being a data scientist. As a practicing data scientist it is entirely common to get a messy data set, a vague question, and very little instruction on exactly how to analyze the data.

In this capstone we will be applying data science in the area of natural language processing. This capstone is inspired by the SwiftKey app, which predicts the next most probable words given a sentence while we are typing on the phones.

Capstone tasks

This course will be separated into 8 different tasks that cover the range of activities encountered by a practicing data scientist. The tasks are:

  • Understanding the problem

  • Data acquisition and cleaning

  • Exploratory analysis

  • Statistical modeling

  • Predictive modeling

  • Creative exploration

  • Creating a data product

  • Creating a short slide deck pitching your product

Capstone dataset

This is the training data to get you started that will be the basis for most of the capstone:

In this exercise, we will use the English database but may consider three other databases in German, Russian and Finnish.

Milestone Report

Note: This report mainly focuses on task 2 (Exploratory analysis) and a little bit of task 1 (Data acquisition and cleaning) of the process.

Getting and cleaning the data

Load the data

Load the English dataset:

blogs <- readLines("../Data/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("../Data/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("../Data/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

Remove all non-English words:

blogs <-iconv(blogs,"latin1","ASCII",sub="")
news <-iconv(news,"latin1","ASCII",sub="")
twitter <-iconv(twitter,"latin1","ASCII",sub="")

Word counts and line counts

Word counts and line counts for each data set:

  • For blogs data set:
library(stringr)

blogs.lines = length(blogs)
blogs.words = sum(str_count(string = blogs, pattern = '\\w+'))
  • For news data set:
news.lines = length(news)
news.words = sum(str_count(string = news, pattern = '\\w+'))
  • For twitter data set:
twitter.lines = length(twitter)
twitter.words = sum(str_count(string = twitter, pattern = '\\w+'))

Construct a data frame of counting:

counting <- data.frame(LineCounts = c(blogs.lines, news.lines, twitter.lines),
                       WordCounts = c(blogs.words, news.words, twitter.words))
rownames(counting) <- c("blogs", "news", "twitter")

head(counting)
        LineCounts WordCounts
blogs       899288   37925076
news       1010242   35519674
twitter    2360148   30989107

Get sub-sample data

To build a model, we don’t need to use all the data at a time. We relatively use few randomly selected rows or chunks to represent for the entire data set. Since the data set is quite large, we only use a sub-sample data set of 1% of the whole data set.

Another reason why we don’t need to use the entire data set is that the frequencies of some common words (e.g. the, and,…) will be higher if we use more data. It will dominate the n-gram vocabulary leading to a fall in the chance that the “rarer” words to be included in the vocabulary. While our goal is to make the vocabulary as variant as possible.

set.seed(42)
SAMPLE.RATE = 0.01
data <- c(sample(blogs, length(blogs) * SAMPLE.RATE),
          sample(news, length(news) * SAMPLE.RATE),
          sample(twitter, length(twitter) * SAMPLE.RATE))

From now, we’ll be using the sub-sample data set data instead of the large data set.

Data preprocessing

Before performing EDA or building models, we first should do some data preprocessing to clean our data. There are few steps to get the data cleaner:

  • Creating corpus

  • Remove numbers

  • Remove punctuation

  • Strip whitespace

  • Lowercase

library(tm) # Text mining
library(NLP)
library(dplyr)

corpus <- VCorpus(VectorSource(data)) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>%
  tm_map(content_transformer(tolower))

Exploratory Data Analysis

Understanding frequencies of words and word pairs using N-gram model

First, we define the n-gram tokenizer for uni-gram, bi-gram and tri-gram:

library(RWeka)
uni.gram.tok <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bi.gram.tok <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
tri.gram.tok <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

Next, we’ll build the matrices of n-gram from our corpus:

uni.gram.matrix <- TermDocumentMatrix(corpus, control=list(tokenize=uni.gram.tok))
bi.gram.matrix <- TermDocumentMatrix(corpus, control=list(tokenize=bi.gram.tok))
tri.gram.matrix <- TermDocumentMatrix(corpus, control=list(tokenize=tri.gram.tok))

The n-gram matrices above are actually the matrices with the rows indicating the words or pairs of words, the columns representing the indices of lines in our sub-sample data. The values of each matrix are the number of times a word or a pair of words appear in a corresponding line.

Finally, we’ll build the matrices that represent the frequencies of words and word pairs and sort the value of occurrence descending. Since there are many words and word pairs to be processed (which take a long time), we only keep the words or word pairs that meet our threshold of frequencies (i.e. if the words or word pairs that less occur, we will omit them):

  • Uni-gram matrix with the threshold of 100 times:
uni.gram.threshold <- findFreqTerms(uni.gram.matrix, lowfreq=100)

# sum of frequencies across the entire lines in the sub-sample
uni.gram <- rowSums(as.matrix(uni.gram.matrix[uni.gram.threshold,]))

# sort the frequency
uni.gram <- sort(uni.gram, decreasing = TRUE)

# create data frame
uni.gram <- data.frame(Word=names(uni.gram), frequency=uni.gram)

head(uni.gram)
     Word frequency
the   the     48341
and   and     24158
for   for     11061
that that     10322
you   you      9362
with with      7183
  • Bi-gram matrix with the threshold of 10 times:
# only keep the words that more occur
bi.gram.threshold <- findFreqTerms(bi.gram.matrix, lowfreq=10)

# sum of frequencies across the entire lines in the sub-sample
bi.gram <- rowSums(as.matrix(bi.gram.matrix[bi.gram.threshold,]))

# sort the frequency
bi.gram <- sort(bi.gram, decreasing = TRUE)

# create data frame
bi.gram <- data.frame(Word=names(bi.gram), frequency=bi.gram)

head(bi.gram)
           Word frequency
of the   of the      4293
in the   in the      4219
to the   to the      2193
for the for the      2018
on the   on the      1889
to be     to be      1662
  • Tri-gram matrix with the threshold of 10 times:
# only keep the words that more occur
tri.gram.threshold <- findFreqTerms(tri.gram.matrix, lowfreq=10)

# sum of frequencies across the entire lines in the sub-sample
tri.gram <- rowSums(as.matrix(tri.gram.matrix[tri.gram.threshold,]))

# sort the frequency
tri.gram <- sort(tri.gram, decreasing = TRUE)

# create data frame
tri.gram <- data.frame(Word=names(tri.gram), frequency=tri.gram)

head(tri.gram)
                         Word frequency
one of the         one of the       323
a lot of             a lot of       300
thanks for the thanks for the       256
to be a               to be a       203
going to be       going to be       191
i want to           i want to       176

Exploratory analysis

  • Plot the word cloud of uni-gram word pairs with the min frequency of 1000 times:
library(wordcloud)
wordcloud(words = uni.gram$Word, 
          freq = uni.gram$frequency, 
          min.freq = 500, 
          random.order = FALSE, 
          scale = c(3, 0.5), 
          colors = rainbow(3))

  • Plot the frequency distribution of uni-gram word pairs:
library(ggplot2)
p <- ggplot(uni.gram, aes(x=frequency)) + 
  geom_histogram()
p

  • Plot the frequency distribution of bi-gram word pairs:
library(ggplot2)
p <- ggplot(bi.gram, aes(x=frequency)) + 
  geom_histogram()
p

  • Plot the frequency distribution of tri-gram word pairs:
p <- ggplot(tri.gram, aes(x=frequency)) + 
  geom_histogram()
p

Save the n-gram matrices

Save the uni-gram, bi-gram and tri-gram matrices:

write.csv(uni.gram, file="../Data/n-gram/uni-gram.csv", row.names=FALSE)
write.csv(bi.gram, file="../Data/n-gram/bi-gram.csv", row.names=FALSE)
write.csv(tri.gram, file="../Data/n-gram/tri-gram.csv", row.names=FALSE)