This report serves as a milestone for the Data Science Capstone Project of Coursera’s Data Science specialization. The goal of this report is to display the process of working with the Swiftkey data to create a prediction algorithm. It explains the exploratory analysis and goals for the eventual app and algorithm. To keep the document concise, it only explains the major features of the data that have been identified and briefly summarizes the plans for creating the prediction algorithm and Shiny app in layman’s terms. There are tables and plots that illustrate important summaries of the data set. The information is organized in the four following sections:
These are all the libraries I used during this analysis.
Natural Language Processing
General Utilities
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(RColorBrewer)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.5
I downloaded the data manually from this link and unzipped the folder containing the text files to a folder in my working directory (C:\data\SwiftKey\data).
For this exploratory analysis I will focus on the english text files. First I will set up the working directory and the variables with the path for the files.
wd <- "C:/data/SwiftKey"
setwd(wd)
path.data <- file.path(wd, "data")
path.data.en <- file.path(path.data, "en_US")
Next I looked at the files themselves. I created a table with the file size in gigabytes, the number of lines and number of words for each file:
df.files <- data.frame(
file = gsub(".*(?<=\\.)(\\w+)(?=\\.).*", "\\1", dir(path.data.en), perl = TRUE),
directory = dir(path.data.en, full.names = TRUE),
size = file.size(dir(path.data.en, full.names = TRUE))/1e9,
lines = as.numeric(sapply(dir(path.data.en, full.names = TRUE), function(x) { gsub('(\\d+).*','\\1', system(sprintf("wc -l %s", x), intern = TRUE)) })),
words = as.numeric(sapply(dir(path.data.en, full.names = TRUE), function(x) { gsub('(\\d+).*','\\1', system(sprintf("wc -w %s", x), intern = TRUE)) })),
row.names = NULL,
stringsAsFactors = FALSE
)
knitr::kable(df.files %>% select(-directory))
| file | size | lines | words |
|---|---|---|---|
| blogs | 0.2101600 | 899288 | 37272578 |
| news | 0.2058119 | 1010242 | 34309642 |
| 0.1671053 | 2360148 | 30341028 |
Since the files are too big to store them in memory, I needed to sample each file. To do this I selected randomly ten thousand lines in each file. I made the assumption that the lines in each file are random and not organized in any way that could skew the data.
sampleFile <- function(file, n) {
start <- Sys.time ()
con <- file(file, "r")
data <- c()
i <- 0
while(i < n){
line <- readLines(con, n = 1, warn = FALSE)
if(rbinom(1,1,0.1)) {
data <- append(data, line)
i <- i + 1
}
}
close(con)
end <- Sys.time() - start
print(end)
return(data)
}
sampleSize <- 10000
data.blogs <- sampleFile(df.files$directory[1], sampleSize)
## Time difference of 13.27132 secs
data.news <- sampleFile(df.files$directory[2], sampleSize)
## Time difference of 9.943623 secs
data.twitter <- sampleFile(df.files$directory[3], sampleSize)
## Time difference of 10.5122 secs
In order to summarize the data, it was better to first pre-process it by tokenizing the text and getting the n-grams for each of the files. For this application I decided to ignore punctuation and case. The drawbacks of these assumptions are that I won’t be able to model and predict things like possessives, abbreviations, acronyms, etc.
This is the function I used to preprocess the data and generate the corpus.
generateCorpus <- function(text, filters) {
start <- Sys.time()
vector <- VectorSource(paste(text, collapse=" "))
corpus <- Corpus(vector)
if (length(filters) > 0) {
for (filter in filters) {
corpus %<>% tm_map(eval(parse(text = filter)))
}
}
end <- Sys.time() - start
print(end)
return(corpus)
}
I generated the corpi by removing spaces, punctuation and setting everything to lower case.
filters <- c("content_transformer(tolower)", "stripWhitespace", "removePunctuation", "removeNumbers")
corpus.blogs <- generateCorpus(data.blogs, filters)
## Time difference of 1.973114 secs
corpus.news <- generateCorpus(data.news, filters)
## Time difference of 1.310073 secs
corpus.twitter <- generateCorpus(data.twitter, filters)
## Time difference of 0.518028 secs
This is the function I created to extract N-grams. I have an argument short that determines if all of the n-grams that appeared only once should be weeded out.
generateNgrams <- function(corpus, n, short = TRUE) {
start <- Sys.time()
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
tdm.matrix <- as.matrix(tdm)
tdm.df <- data.frame(ngram = rownames(tdm.matrix), freq = tdm.matrix[,1], stringsAsFactors = FALSE)
rownames(tdm.df) <- NULL
if (short) {
tdm.df %>% filter(freq > 1)
}
end <- Sys.time() - start
print(end)
return(tdm.df %>% arrange(desc(freq)))
}
First I got the 1-grams for each of the samples without removing those n-grams that appear only once.
ngrams.1.blogs <- generateNgrams(corpus.blogs, 1, short = FALSE)
## Time difference of 8.51951 mins
ngrams.1.news <- generateNgrams(corpus.news, 1, short = FALSE)
## Time difference of 2.743273 mins
ngrams.1.twitter <- generateNgrams(corpus.twitter, 1, short = FALSE)
## Time difference of 30.3023 secs
Here’s a table with the number of unique words, total word count and ratio between unique and total words for each file.
knitr::kable(data.frame(file = df.files$file, uniques = sapply(c(ngrams.1.blogs,
ngrams.1.news, ngrams.1.twitter), length), words = df.files$words, uwR = sapply(c(ngrams.1.blogs,
ngrams.1.news, ngrams.1.twitter), length)/df.files$words))
| file | uniques | words | uwR |
|---|---|---|---|
| blogs | 33214 | 37272578 | 0.0008911 |
| news | 33214 | 34309642 | 0.0009681 |
| 27092 | 30341028 | 0.0008929 | |
| blogs | 27092 | 37272578 | 0.0007269 |
| news | 15302 | 34309642 | 0.0004460 |
| 15302 | 30341028 | 0.0005043 |
Here’s a table with the top 10 words by frequency for each file.
knitr::kable(data.frame(blogs.words = ngrams.1.blogs$ngram[1:10], news.words = ngrams.1.news$ngram[1:10],
twitter.words = ngrams.1.twitter$ngram[1:10]))
| blogs.words | news.words | twitter.words |
|---|---|---|
| the | the | the |
| and | and | you |
| that | for | and |
| for | that | for |
| you | with | that |
| with | said | with |
| was | was | your |
| this | his | are |
| have | but | this |
| but | from | just |
Now let’s remove the stop words (the most common words) and show the list.
knitr::kable(data.frame(blogs.words = head(ngrams.1.blogs$ngram[!(ngrams.1.blogs$ngram %in%
stopwords("english"))], 10), news.words = head(ngrams.1.news$ngram[!(ngrams.1.news$ngram %in%
stopwords("english"))], 10), twitter.words = head(ngrams.1.twitter$ngram[!(ngrams.1.twitter$ngram %in%
stopwords("english"))], 10)))
| blogs.words | news.words | twitter.words |
|---|---|---|
| one | said | just |
| will | will | like |
| like | one | love |
| just | new | get |
| can | also | good |
| time | two | will |
| get | just | now |
| know | can | day |
| now | time | can |
| people | year | thanks |
The following graph shows the percentage of those 1-grams that appear only once versus the total of 1-grams.
d <- data.frame(file = df.files$file, all = c(length(ngrams.1.blogs$ngram),
length(ngrams.1.news$ngram), length(ngrams.1.twitter$ngram)), repeated = c(sum(ngrams.1.blogs$freq >
1), sum(ngrams.1.news$freq > 1), sum(ngrams.1.twitter$freq > 1))) %>% gather(type,
count, 2:3)
ggplot(d, aes(file, count, fill = type)) + geom_bar(position = "dodge", stat = "identity") +
scale_fill_brewer(palette = "Pastel1")
We want to have the stop words for the prediction algorithm since people use these words when writing (e.g. the, for, you).
Next I got the 2-grams, 3-grams and 4-grams for all of the files.
ngrams.2.blogs <- generateNgrams(corpus.blogs, 2, short = FALSE)
## Time difference of 14.9426 mins
ngrams.2.news <- generateNgrams(corpus.news, 2, short = FALSE)
## Time difference of 15.31443 mins
ngrams.2.twitter <- generateNgrams(corpus.twitter, 2, short = FALSE)
## Time difference of 54.65114 secs
knitr::kable(data.frame(blogs.words = ngrams.2.blogs$ngram[1:10], news.words = ngrams.2.news$ngram[1:10],
twitter.words = ngrams.2.twitter$ngram[1:10]))
| blogs.words | news.words | twitter.words |
|---|---|---|
| of the | of the | in the |
| in the | in the | for the |
| to the | to the | of the |
| on the | on the | to be |
| to be | for the | on the |
| and the | at the | to the |
| for the | and the | thanks for |
| i was | in a | i love |
| i have | to be | going to |
| and i | from the | i am |
ngrams.3.blogs <- generateNgrams(corpus.blogs, 3, short = FALSE)
## Time difference of 5.435591 mins
ngrams.3.news <- generateNgrams(corpus.news, 3, short = FALSE)
## Time difference of 1.783212 mins
ngrams.3.twitter <- generateNgrams(corpus.twitter, 3, short = FALSE)
## Time difference of 36.64492 secs
knitr::kable(data.frame(blogs.words = ngrams.3.blogs$ngram[1:10], news.words = ngrams.3.news$ngram[1:10],
twitter.words = ngrams.3.twitter$ngram[1:10]))
| blogs.words | news.words | twitter.words |
|---|---|---|
| one of the | one of the | thanks for the |
| a lot of | a lot of | cant wait to |
| to be a | going to be | going to be |
| out of the | as well as | thank you for |
| the end of | some of the | i want to |
| a couple of | according to the | looking forward to |
| as well as | in the first | i need to |
| the fact that | out of the | for the follow |
| i want to | the end of | i love you |
| i have to | part of the | to see you |
ngrams.4.blogs <- generateNgrams(corpus.blogs, 4, short = FALSE)
## Time difference of 5.685228 mins
ngrams.4.news <- generateNgrams(corpus.news, 4, short = FALSE)
## Time difference of 1.858023 mins
ngrams.4.twitter <- generateNgrams(corpus.twitter, 4, short = FALSE)
## Time difference of 21.19924 secs
knitr::kable(data.frame(blogs.words = ngrams.4.blogs$ngram[1:10], news.words = ngrams.4.news$ngram[1:10],
twitter.words = ngrams.4.twitter$ngram[1:10]))
| blogs.words | news.words | twitter.words |
|---|---|---|
| the end of the | for the first time | thanks for the follow |
| at the end of | when it comes to | x x x x |
| the rest of the | is one of the | thank you for the |
| one of the most | the end of the | thanks for the rt |
| is one of the | at the end of | cant wait to see |
| at the same time | the rest of the | is going to be |
| to be able to | one of the most | are you going to |
| in the middle of | at the university of | for the first time |
| if you want to | one of the best | going to be a |
| for those of you | a member of the | hope to see you |
Since our app will aim to predict any kind of text and not only that for specific purposes (blogging, writing news, etc), we should aggregate all of the ngrams and frequencies.
ngrams.1.all <- full_join(ngrams.1.blogs, ngrams.1.news) %>%
full_join(ngrams.1.twitter) %>%
group_by(ngram) %>%
summarise(freq = sum(freq)) %>%
arrange(desc(freq))
## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")
ngrams.2.all <- full_join(ngrams.2.blogs, ngrams.2.news) %>%
full_join(ngrams.2.twitter) %>%
group_by(ngram) %>%
summarise(freq = sum(freq)) %>%
arrange(desc(freq))
## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")
ngrams.3.all <- full_join(ngrams.3.blogs, ngrams.3.news) %>%
full_join(ngrams.3.twitter) %>%
group_by(ngram) %>%
summarise(freq = sum(freq)) %>%
arrange(desc(freq))
## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")
ngrams.4.all <- full_join(ngrams.4.blogs, ngrams.4.news) %>%
full_join(ngrams.4.twitter) %>%
group_by(ngram) %>%
summarise(freq = sum(freq)) %>%
arrange(desc(freq))
## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")
We want to understand how many n-grams we need to make up a certain percentage of the text (coverage). For this I’ve created the following function:
coverPercentage <- function(d, p, percentage = FALSE) {
start <- Sys.time()
l <- dim(d)[1]
p <- sum(d[,2])*p
sum <- 0
for (i in 1:l) {
sum <- sum + d[i,2]
if (sum >= p) {
end <- Sys.time() - start
print(end)
if (percentage) {
return(i/l)
} else {
return(i)
}
}
}
}
The number of 1-grams we need to cover 50% of the text is 307.
I have also created a function to plot this:
plotCoverPercentage <- function(d, range = seq(0,1,0.1)) {
x = sapply(range, coverPercentage, d = d)
y = range
qplot(x = x, y = y, geom = "line", ylab = "Percentage of text covered", xlab = "Number of N-grams", main = "Percentage of text covered \n by most frequent N-grams")
}
Let’s plot the coverage curve for 1-grams:
plotCoverPercentage(ngrams.1.all, range = seq(0,1,0.1))
## Time difference of 0 secs
## Time difference of 0.003998995 secs
## Time difference of 0.01200104 secs
## Time difference of 0.03200006 secs
## Time difference of 0.091856 secs
## Time difference of 0.4341888 secs
## Time difference of 0.9336991 secs
## Time difference of 1.926254 secs
## Time difference of 4.65792 secs
## Time difference of 12.60809 secs
## Time difference of 51.18433 secs
Here’s a function to plot the Wordcloud of the top n n-grams by frequency:
plotWordCloud <- function(d, n = 10) {
d <- d[1:n,]
pal <- brewer.pal(8,"Accent")
wordcloud(words = d$ngram, freq = d$freq, random.order = FALSE, random.color = TRUE, colors = pal, scale = c(3,.8))
}
Let’s see the wordcloud for the top 50 1-grams of all the data.
plotWordCloud(ngrams.1.all, n = 100)
Here’s a function to plot a histogram of the top n-grams:
plotFrequency <- function(d, n = 10) {
require(ggplot2)
d = d[1:n,]
ggplot(data = d, aes(x = reorder(ngram, -freq), y = freq)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Ngram")
}
Finally let’s see the wordcloud for the top 10 1-grams, 2-grams, 3-grams and 4-grams of all the data:
plotFrequency(ngrams.1.all)
plotFrequency(ngrams.2.all)
plotFrequency(ngrams.3.all)
plotFrequency(ngrams.4.all)
There were no surprises, stop words are always the top 10 words. The vast majority of n-grams are used only one time. The dataset is too big.
I will need to scale the algorithm. For this I will have to get rid of all of those n-grams that are used less than X times (maybe 20 times). And to smooth the model I will have to add plus one to the frequency of each n-gram, even those that have never appeared. I will then have to use a markov-chain model with naive-bayes to make the predicitions.