The goal of this capstone is to mimic the experience of being a data scientist. As a practicing data scientist it is entirely common to get a messy data set, a vague question, and very little instruction on exactly how to analyze the data.
In this capstone we will be applying data science in the area of natural language processing. This capstone is inspired by the SwiftKey app, which predicts the next most probable words given a sentence while we are typing on the phones.
This course will be separated into 8 different tasks that cover the range of activities encountered by a practicing data scientist. The tasks are:
Understanding the problem
Data acquisition and cleaning
Exploratory analysis
Statistical modeling
Predictive modeling
Creative exploration
Creating a data product
Creating a short slide deck pitching your product
This is the training data to get you started that will be the basis for most of the capstone:
In this exercise, we will use the English database but may consider three other databases in German, Russian and Finnish.
Note: This report mainly focuses on task 2 (Exploratory analysis) and a little bit of task 1 (Data acquisition and cleaning) of the process.
Load the English dataset:
blogs <- readLines("../Data/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("../Data/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("../Data/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
Remove all non-English words:
blogs <-iconv(blogs,"latin1","ASCII",sub="")
news <-iconv(news,"latin1","ASCII",sub="")
twitter <-iconv(twitter,"latin1","ASCII",sub="")
Word counts and line counts for each data set:
library(stringr)
blogs.lines = length(blogs)
blogs.words = sum(str_count(string = blogs, pattern = '\\w+'))
news.lines = length(news)
news.words = sum(str_count(string = news, pattern = '\\w+'))
twitter.lines = length(twitter)
twitter.words = sum(str_count(string = twitter, pattern = '\\w+'))
Construct a data frame of counting:
counting <- data.frame(LineCounts = c(blogs.lines, news.lines, twitter.lines),
WordCounts = c(blogs.words, news.words, twitter.words))
rownames(counting) <- c("blogs", "news", "twitter")
head(counting)
LineCounts WordCounts
blogs 899288 37925076
news 1010242 35519674
twitter 2360148 30989107
To build a model, we don’t need to use all the data at a time. We relatively use few randomly selected rows or chunks to represent for the entire data set. Since the data set is quite large, we only use a sub-sample data set of 1% of the whole data set.
Another reason why we don’t need to use the entire data set is that the frequencies of some common words (e.g. the, and,…) will be higher if we use more data. It will dominate the n-gram vocabulary leading to a fall in the chance that the “rarer” words to be included in the vocabulary. While our goal is to make the vocabulary as variant as possible.
set.seed(42)
SAMPLE.RATE = 0.01
data <- c(sample(blogs, length(blogs) * SAMPLE.RATE),
sample(news, length(news) * SAMPLE.RATE),
sample(twitter, length(twitter) * SAMPLE.RATE))
From now, we’ll be using the sub-sample data set data
instead of the large data set.
Before performing EDA or building models, we first should do some data preprocessing to clean our data. There are few steps to get the data cleaner:
Creating corpus
Remove numbers
Remove punctuation
Strip whitespace
Lowercase
library(tm) # Text mining
library(NLP)
library(dplyr)
corpus <- VCorpus(VectorSource(data)) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace) %>%
tm_map(content_transformer(tolower))
First, we define the n-gram tokenizer for uni-gram, bi-gram and tri-gram:
library(RWeka)
uni.gram.tok <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bi.gram.tok <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
tri.gram.tok <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
Next, we’ll build the matrices of n-gram from our corpus:
uni.gram.matrix <- TermDocumentMatrix(corpus, control=list(tokenize=uni.gram.tok))
bi.gram.matrix <- TermDocumentMatrix(corpus, control=list(tokenize=bi.gram.tok))
tri.gram.matrix <- TermDocumentMatrix(corpus, control=list(tokenize=tri.gram.tok))
The n-gram matrices above are actually the matrices with the rows indicating the words or pairs of words, the columns representing the indices of lines in our sub-sample data. The values of each matrix are the number of times a word or a pair of words appear in a corresponding line.
Finally, we’ll build the matrices that represent the frequencies of words and word pairs and sort the value of occurrence descending. Since there are many words and word pairs to be processed (which take a long time), we only keep the words or word pairs that meet our threshold of frequencies (i.e. if the words or word pairs that less occur, we will omit them):
uni.gram.threshold <- findFreqTerms(uni.gram.matrix, lowfreq=100)
# sum of frequencies across the entire lines in the sub-sample
uni.gram <- rowSums(as.matrix(uni.gram.matrix[uni.gram.threshold,]))
# sort the frequency
uni.gram <- sort(uni.gram, decreasing = TRUE)
# create data frame
uni.gram <- data.frame(Word=names(uni.gram), frequency=uni.gram)
head(uni.gram)
Word frequency
the the 48341
and and 24158
for for 11061
that that 10322
you you 9362
with with 7183
# only keep the words that more occur
bi.gram.threshold <- findFreqTerms(bi.gram.matrix, lowfreq=10)
# sum of frequencies across the entire lines in the sub-sample
bi.gram <- rowSums(as.matrix(bi.gram.matrix[bi.gram.threshold,]))
# sort the frequency
bi.gram <- sort(bi.gram, decreasing = TRUE)
# create data frame
bi.gram <- data.frame(Word=names(bi.gram), frequency=bi.gram)
head(bi.gram)
Word frequency
of the of the 4293
in the in the 4219
to the to the 2193
for the for the 2018
on the on the 1889
to be to be 1662
# only keep the words that more occur
tri.gram.threshold <- findFreqTerms(tri.gram.matrix, lowfreq=10)
# sum of frequencies across the entire lines in the sub-sample
tri.gram <- rowSums(as.matrix(tri.gram.matrix[tri.gram.threshold,]))
# sort the frequency
tri.gram <- sort(tri.gram, decreasing = TRUE)
# create data frame
tri.gram <- data.frame(Word=names(tri.gram), frequency=tri.gram)
head(tri.gram)
Word frequency
one of the one of the 323
a lot of a lot of 300
thanks for the thanks for the 256
to be a to be a 203
going to be going to be 191
i want to i want to 176
library(wordcloud)
wordcloud(words = uni.gram$Word,
freq = uni.gram$frequency,
min.freq = 500,
random.order = FALSE,
scale = c(3, 0.5),
colors = rainbow(3))
library(ggplot2)
p <- ggplot(uni.gram, aes(x=frequency)) +
geom_histogram()
p
library(ggplot2)
p <- ggplot(bi.gram, aes(x=frequency)) +
geom_histogram()
p
p <- ggplot(tri.gram, aes(x=frequency)) +
geom_histogram()
p
Save the uni-gram, bi-gram and tri-gram matrices:
write.csv(uni.gram, file="../Data/n-gram/uni-gram.csv", row.names=FALSE)
write.csv(bi.gram, file="../Data/n-gram/bi-gram.csv", row.names=FALSE)
write.csv(tri.gram, file="../Data/n-gram/tri-gram.csv", row.names=FALSE)