SwiftKey - Coursera Project - Milestone Report

This report serves as a milestone for the Data Science Capstone Project of Coursera’s Data Science specialization. The goal of this report is to display the process of working with the Swiftkey data to create a prediction algorithm. It explains the exploratory analysis and goals for the eventual app and algorithm. To keep the document concise, it only explains the major features of the data that have been identified and briefly summarizes the plans for creating the prediction algorithm and Shiny app in layman’s terms. There are tables and plots that illustrate important summaries of the data set. The information is organized in the four following sections:

Libraries
Downloading and loading the data.
Summary statistics.
Interesting findings.
Prediction algorithm and Shiny app.

Libraries

These are all the libraries I used during this analysis.

Natural Language Processing

General Utilities

library(magrittr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(RColorBrewer)
library(wordcloud)

## Warning: package 'wordcloud' was built under R version 3.2.5

Downloading and loading the data

I downloaded the data manually from this link and unzipped the folder containing the text files to a folder in my working directory (C:\data\SwiftKey\data).

For this exploratory analysis I will focus on the english text files. First I will set up the working directory and the variables with the path for the files.

wd <- "C:/data/SwiftKey"
setwd(wd)
path.data <- file.path(wd, "data")
path.data.en <- file.path(path.data, "en_US")

Next I looked at the files themselves. I created a table with the file size in gigabytes, the number of lines and number of words for each file:

df.files <- data.frame(
  file = gsub(".*(?<=\\.)(\\w+)(?=\\.).*", "\\1", dir(path.data.en), perl = TRUE),
  directory = dir(path.data.en, full.names = TRUE),
  size = file.size(dir(path.data.en, full.names = TRUE))/1e9,
  lines = as.numeric(sapply(dir(path.data.en, full.names = TRUE), function(x) { gsub('(\\d+).*','\\1', system(sprintf("wc -l %s", x), intern = TRUE)) })),
  words = as.numeric(sapply(dir(path.data.en, full.names = TRUE), function(x) { gsub('(\\d+).*','\\1', system(sprintf("wc -w %s", x), intern = TRUE)) })),
  row.names = NULL,
  stringsAsFactors = FALSE
)

knitr::kable(df.files %>% select(-directory))

file	size	lines	words
blogs	0.2101600	899288	37272578
news	0.2058119	1010242	34309642
twitter	0.1671053	2360148	30341028

Since the files are too big to store them in memory, I needed to sample each file. To do this I selected randomly ten thousand lines in each file. I made the assumption that the lines in each file are random and not organized in any way that could skew the data.

sampleFile <- function(file, n) {
  start <- Sys.time ()
  con <- file(file, "r")
  data <- c()
  i <- 0
  while(i < n){
    line <- readLines(con, n = 1, warn = FALSE)
    
    if(rbinom(1,1,0.1)) {
      data <- append(data, line)
      i <- i + 1
    }
  }
  close(con)
  end <- Sys.time() - start
  print(end)
  return(data)
}

sampleSize <- 10000
data.blogs <- sampleFile(df.files$directory[1], sampleSize)

## Time difference of 13.27132 secs

data.news <- sampleFile(df.files$directory[2], sampleSize)

## Time difference of 9.943623 secs

data.twitter <- sampleFile(df.files$directory[3], sampleSize)

## Time difference of 10.5122 secs

Summary statistics

In order to summarize the data, it was better to first pre-process it by tokenizing the text and getting the n-grams for each of the files. For this application I decided to ignore punctuation and case. The drawbacks of these assumptions are that I won’t be able to model and predict things like possessives, abbreviations, acronyms, etc.

This is the function I used to preprocess the data and generate the corpus.

generateCorpus <- function(text, filters) {
    start <- Sys.time()
    
    vector <- VectorSource(paste(text, collapse=" "))
    corpus <- Corpus(vector)

    if (length(filters) > 0) {
      for (filter in filters) {
        corpus %<>% tm_map(eval(parse(text = filter)))
      }
    }
    
    end <- Sys.time() - start
    print(end)
    return(corpus)
}

I generated the corpi by removing spaces, punctuation and setting everything to lower case.

filters <- c("content_transformer(tolower)", "stripWhitespace", "removePunctuation", "removeNumbers")

corpus.blogs <- generateCorpus(data.blogs, filters)

## Time difference of 1.973114 secs

corpus.news <- generateCorpus(data.news, filters)

## Time difference of 1.310073 secs

corpus.twitter <- generateCorpus(data.twitter, filters)

## Time difference of 0.518028 secs

This is the function I created to extract N-grams. I have an argument short that determines if all of the n-grams that appeared only once should be weeded out.

generateNgrams <- function(corpus, n, short = TRUE) {
  start <- Sys.time()
  
  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
  tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
  tdm.matrix <- as.matrix(tdm)
  tdm.df <- data.frame(ngram = rownames(tdm.matrix), freq = tdm.matrix[,1], stringsAsFactors = FALSE)
  rownames(tdm.df) <- NULL
  if (short) {
    tdm.df %>% filter(freq > 1)
  }
  
  end <- Sys.time() - start
  print(end)
  return(tdm.df %>% arrange(desc(freq)))
}

First I got the 1-grams for each of the samples without removing those n-grams that appear only once.

ngrams.1.blogs <- generateNgrams(corpus.blogs, 1, short = FALSE)

## Time difference of 8.51951 mins

ngrams.1.news <- generateNgrams(corpus.news, 1, short = FALSE)

## Time difference of 2.743273 mins

ngrams.1.twitter <- generateNgrams(corpus.twitter, 1, short = FALSE)

## Time difference of 30.3023 secs

Here’s a table with the number of unique words, total word count and ratio between unique and total words for each file.

knitr::kable(data.frame(file = df.files$file, uniques = sapply(c(ngrams.1.blogs, 
    ngrams.1.news, ngrams.1.twitter), length), words = df.files$words, uwR = sapply(c(ngrams.1.blogs, 
    ngrams.1.news, ngrams.1.twitter), length)/df.files$words))

file	uniques	words	uwR
blogs	33214	37272578	0.0008911
news	33214	34309642	0.0009681
twitter	27092	30341028	0.0008929
blogs	27092	37272578	0.0007269
news	15302	34309642	0.0004460
twitter	15302	30341028	0.0005043

Here’s a table with the top 10 words by frequency for each file.

knitr::kable(data.frame(blogs.words = ngrams.1.blogs$ngram[1:10], news.words = ngrams.1.news$ngram[1:10], 
    twitter.words = ngrams.1.twitter$ngram[1:10]))

blogs.words	news.words	twitter.words
the	the	the
and	and	you
that	for	and
for	that	for
you	with	that
with	said	with
was	was	your
this	his	are
have	but	this
but	from	just

Now let’s remove the stop words (the most common words) and show the list.

knitr::kable(data.frame(blogs.words = head(ngrams.1.blogs$ngram[!(ngrams.1.blogs$ngram %in% 
    stopwords("english"))], 10), news.words = head(ngrams.1.news$ngram[!(ngrams.1.news$ngram %in% 
    stopwords("english"))], 10), twitter.words = head(ngrams.1.twitter$ngram[!(ngrams.1.twitter$ngram %in% 
    stopwords("english"))], 10)))

blogs.words	news.words	twitter.words
one	said	just
will	will	like
like	one	love
just	new	get
can	also	good
time	two	will
get	just	now
know	can	day
now	time	can
people	year	thanks

The following graph shows the percentage of those 1-grams that appear only once versus the total of 1-grams.

d <- data.frame(file = df.files$file, all = c(length(ngrams.1.blogs$ngram), 
    length(ngrams.1.news$ngram), length(ngrams.1.twitter$ngram)), repeated = c(sum(ngrams.1.blogs$freq > 
    1), sum(ngrams.1.news$freq > 1), sum(ngrams.1.twitter$freq > 1))) %>% gather(type, 
    count, 2:3)

ggplot(d, aes(file, count, fill = type)) + geom_bar(position = "dodge", stat = "identity") + 
    scale_fill_brewer(palette = "Pastel1")

We want to have the stop words for the prediction algorithm since people use these words when writing (e.g. the, for, you).

Next I got the 2-grams, 3-grams and 4-grams for all of the files.

ngrams.2.blogs <- generateNgrams(corpus.blogs, 2, short = FALSE)

## Time difference of 14.9426 mins

ngrams.2.news <- generateNgrams(corpus.news, 2, short = FALSE)

## Time difference of 15.31443 mins

ngrams.2.twitter <- generateNgrams(corpus.twitter, 2, short = FALSE)

## Time difference of 54.65114 secs

knitr::kable(data.frame(blogs.words = ngrams.2.blogs$ngram[1:10], news.words = ngrams.2.news$ngram[1:10], 
    twitter.words = ngrams.2.twitter$ngram[1:10]))

blogs.words	news.words	twitter.words
of the	of the	in the
in the	in the	for the
to the	to the	of the
on the	on the	to be
to be	for the	on the
and the	at the	to the
for the	and the	thanks for
i was	in a	i love
i have	to be	going to
and i	from the	i am

ngrams.3.blogs <- generateNgrams(corpus.blogs, 3, short = FALSE)

## Time difference of 5.435591 mins

ngrams.3.news <- generateNgrams(corpus.news, 3, short = FALSE)

## Time difference of 1.783212 mins

ngrams.3.twitter <- generateNgrams(corpus.twitter, 3, short = FALSE)

## Time difference of 36.64492 secs

knitr::kable(data.frame(blogs.words = ngrams.3.blogs$ngram[1:10], news.words = ngrams.3.news$ngram[1:10], 
    twitter.words = ngrams.3.twitter$ngram[1:10]))

blogs.words	news.words	twitter.words
one of the	one of the	thanks for the
a lot of	a lot of	cant wait to
to be a	going to be	going to be
out of the	as well as	thank you for
the end of	some of the	i want to
a couple of	according to the	looking forward to
as well as	in the first	i need to
the fact that	out of the	for the follow
i want to	the end of	i love you
i have to	part of the	to see you

ngrams.4.blogs <- generateNgrams(corpus.blogs, 4, short = FALSE)

## Time difference of 5.685228 mins

ngrams.4.news <- generateNgrams(corpus.news, 4, short = FALSE)

## Time difference of 1.858023 mins

ngrams.4.twitter <- generateNgrams(corpus.twitter, 4, short = FALSE)

## Time difference of 21.19924 secs

knitr::kable(data.frame(blogs.words = ngrams.4.blogs$ngram[1:10], news.words = ngrams.4.news$ngram[1:10], 
    twitter.words = ngrams.4.twitter$ngram[1:10]))

blogs.words	news.words	twitter.words
the end of the	for the first time	thanks for the follow
at the end of	when it comes to	x x x x
the rest of the	is one of the	thank you for the
one of the most	the end of the	thanks for the rt
is one of the	at the end of	cant wait to see
at the same time	the rest of the	is going to be
to be able to	one of the most	are you going to
in the middle of	at the university of	for the first time
if you want to	one of the best	going to be a
for those of you	a member of the	hope to see you

Since our app will aim to predict any kind of text and not only that for specific purposes (blogging, writing news, etc), we should aggregate all of the ngrams and frequencies.

ngrams.1.all <- full_join(ngrams.1.blogs, ngrams.1.news) %>% 
  full_join(ngrams.1.twitter) %>% 
  group_by(ngram) %>% 
  summarise(freq = sum(freq)) %>% 
  arrange(desc(freq))

## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")

ngrams.2.all <- full_join(ngrams.2.blogs, ngrams.2.news) %>% 
  full_join(ngrams.2.twitter) %>% 
  group_by(ngram) %>% 
  summarise(freq = sum(freq)) %>% 
  arrange(desc(freq))

## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")

ngrams.3.all <- full_join(ngrams.3.blogs, ngrams.3.news) %>% 
  full_join(ngrams.3.twitter) %>% 
  group_by(ngram) %>% 
  summarise(freq = sum(freq)) %>% 
  arrange(desc(freq))

## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")

ngrams.4.all <- full_join(ngrams.4.blogs, ngrams.4.news) %>% 
  full_join(ngrams.4.twitter) %>% 
  group_by(ngram) %>% 
  summarise(freq = sum(freq)) %>% 
  arrange(desc(freq))

## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")

We want to understand how many n-grams we need to make up a certain percentage of the text (coverage). For this I’ve created the following function:

coverPercentage <- function(d, p, percentage = FALSE) {
  start <- Sys.time()
  
  l <- dim(d)[1]
  p <- sum(d[,2])*p
  
  sum <- 0
  
  for (i in 1:l) {
    sum <- sum + d[i,2] 
    if (sum >= p) {
      end <- Sys.time() - start
      print(end)
      
      if (percentage) {
        return(i/l)
      } else {
        return(i) 
      }
    }
  }
}

The number of 1-grams we need to cover 50% of the text is 307.

I have also created a function to plot this:

plotCoverPercentage <- function(d, range = seq(0,1,0.1)) {
  x = sapply(range, coverPercentage, d = d)
  y = range
  qplot(x = x, y = y, geom = "line", ylab = "Percentage of text covered", xlab = "Number of N-grams", main = "Percentage of text covered \n by most frequent N-grams")
}

Let’s plot the coverage curve for 1-grams:

plotCoverPercentage(ngrams.1.all, range = seq(0,1,0.1))

## Time difference of 0 secs
## Time difference of 0.003998995 secs
## Time difference of 0.01200104 secs
## Time difference of 0.03200006 secs
## Time difference of 0.091856 secs
## Time difference of 0.4341888 secs
## Time difference of 0.9336991 secs
## Time difference of 1.926254 secs
## Time difference of 4.65792 secs
## Time difference of 12.60809 secs
## Time difference of 51.18433 secs

Here’s a function to plot the Wordcloud of the top n n-grams by frequency:

plotWordCloud <- function(d, n = 10) {
  d <- d[1:n,]
  pal <- brewer.pal(8,"Accent")
  wordcloud(words = d$ngram, freq = d$freq, random.order = FALSE, random.color = TRUE, colors = pal, scale = c(3,.8))
}

Let’s see the wordcloud for the top 50 1-grams of all the data.

plotWordCloud(ngrams.1.all, n = 100)

Here’s a function to plot a histogram of the top n-grams:

plotFrequency <- function(d, n = 10) {
  require(ggplot2)
  d = d[1:n,]
  ggplot(data = d, aes(x = reorder(ngram, -freq), y = freq)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Ngram")
}

Finally let’s see the wordcloud for the top 10 1-grams, 2-grams, 3-grams and 4-grams of all the data:

plotFrequency(ngrams.1.all)

plotFrequency(ngrams.2.all)

plotFrequency(ngrams.3.all)

plotFrequency(ngrams.4.all)

Interesting findings

There were no surprises, stop words are always the top 10 words. The vast majority of n-grams are used only one time. The dataset is too big.

Prediction algorithm and Shiny app

I will need to scale the algorithm. For this I will have to get rid of all of those n-grams that are used less than X times (maybe 20 times). And to smooth the model I will have to add plus one to the frequency of each n-gram, even those that have never appeared. I will then have to use a markov-chain model with naive-bayes to make the predicitions.