This report serves as a milestone for the Data Science Capstone Project of Coursera’s Data Science specialization. The goal of this report is to display the process of working with the Swiftkey data to create a prediction algorithm. It explains the exploratory analysis and goals for the eventual app and algorithm. To keep the document concise, it only explains the major features of the data that have been identified and briefly summarizes the plans for creating the prediction algorithm and Shiny app in layman’s terms. There are tables and plots that illustrate important summaries of the data set. The information is organized in the four following sections:

  1. Libraries
  2. Downloading and loading the data.
  3. Summary statistics.
  4. Interesting findings.
  5. Prediction algorithm and Shiny app.

Libraries

These are all the libraries I used during this analysis.

Natural Language Processing

General Utilities

library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(RColorBrewer)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.5

Downloading and loading the data

I downloaded the data manually from this link and unzipped the folder containing the text files to a folder in my working directory (C:\data\SwiftKey\data).

For this exploratory analysis I will focus on the english text files. First I will set up the working directory and the variables with the path for the files.

wd <- "C:/data/SwiftKey"
setwd(wd)
path.data <- file.path(wd, "data")
path.data.en <- file.path(path.data, "en_US")

Next I looked at the files themselves. I created a table with the file size in gigabytes, the number of lines and number of words for each file:

df.files <- data.frame(
  file = gsub(".*(?<=\\.)(\\w+)(?=\\.).*", "\\1", dir(path.data.en), perl = TRUE),
  directory = dir(path.data.en, full.names = TRUE),
  size = file.size(dir(path.data.en, full.names = TRUE))/1e9,
  lines = as.numeric(sapply(dir(path.data.en, full.names = TRUE), function(x) { gsub('(\\d+).*','\\1', system(sprintf("wc -l %s", x), intern = TRUE)) })),
  words = as.numeric(sapply(dir(path.data.en, full.names = TRUE), function(x) { gsub('(\\d+).*','\\1', system(sprintf("wc -w %s", x), intern = TRUE)) })),
  row.names = NULL,
  stringsAsFactors = FALSE
)
knitr::kable(df.files %>% select(-directory))
file size lines words
blogs 0.2101600 899288 37272578
news 0.2058119 1010242 34309642
twitter 0.1671053 2360148 30341028

Since the files are too big to store them in memory, I needed to sample each file. To do this I selected randomly ten thousand lines in each file. I made the assumption that the lines in each file are random and not organized in any way that could skew the data.

sampleFile <- function(file, n) {
  start <- Sys.time ()
  con <- file(file, "r")
  data <- c()
  i <- 0
  while(i < n){
    line <- readLines(con, n = 1, warn = FALSE)
    
    if(rbinom(1,1,0.1)) {
      data <- append(data, line)
      i <- i + 1
    }
  }
  close(con)
  end <- Sys.time() - start
  print(end)
  return(data)
}

sampleSize <- 10000
data.blogs <- sampleFile(df.files$directory[1], sampleSize)
## Time difference of 13.27132 secs
data.news <- sampleFile(df.files$directory[2], sampleSize)
## Time difference of 9.943623 secs
data.twitter <- sampleFile(df.files$directory[3], sampleSize)
## Time difference of 10.5122 secs

Summary statistics

In order to summarize the data, it was better to first pre-process it by tokenizing the text and getting the n-grams for each of the files. For this application I decided to ignore punctuation and case. The drawbacks of these assumptions are that I won’t be able to model and predict things like possessives, abbreviations, acronyms, etc.

This is the function I used to preprocess the data and generate the corpus.

generateCorpus <- function(text, filters) {
    start <- Sys.time()
    
    vector <- VectorSource(paste(text, collapse=" "))
    corpus <- Corpus(vector)

    if (length(filters) > 0) {
      for (filter in filters) {
        corpus %<>% tm_map(eval(parse(text = filter)))
      }
    }
    
    end <- Sys.time() - start
    print(end)
    return(corpus)
}

I generated the corpi by removing spaces, punctuation and setting everything to lower case.

filters <- c("content_transformer(tolower)", "stripWhitespace", "removePunctuation", "removeNumbers")

corpus.blogs <- generateCorpus(data.blogs, filters)
## Time difference of 1.973114 secs
corpus.news <- generateCorpus(data.news, filters)
## Time difference of 1.310073 secs
corpus.twitter <- generateCorpus(data.twitter, filters)
## Time difference of 0.518028 secs

This is the function I created to extract N-grams. I have an argument short that determines if all of the n-grams that appeared only once should be weeded out.

generateNgrams <- function(corpus, n, short = TRUE) {
  start <- Sys.time()
  
  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
  tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
  tdm.matrix <- as.matrix(tdm)
  tdm.df <- data.frame(ngram = rownames(tdm.matrix), freq = tdm.matrix[,1], stringsAsFactors = FALSE)
  rownames(tdm.df) <- NULL
  if (short) {
    tdm.df %>% filter(freq > 1)
  }
  
  end <- Sys.time() - start
  print(end)
  return(tdm.df %>% arrange(desc(freq)))
}

First I got the 1-grams for each of the samples without removing those n-grams that appear only once.

ngrams.1.blogs <- generateNgrams(corpus.blogs, 1, short = FALSE)
## Time difference of 8.51951 mins
ngrams.1.news <- generateNgrams(corpus.news, 1, short = FALSE)
## Time difference of 2.743273 mins
ngrams.1.twitter <- generateNgrams(corpus.twitter, 1, short = FALSE)
## Time difference of 30.3023 secs

Here’s a table with the number of unique words, total word count and ratio between unique and total words for each file.

knitr::kable(data.frame(file = df.files$file, uniques = sapply(c(ngrams.1.blogs, 
    ngrams.1.news, ngrams.1.twitter), length), words = df.files$words, uwR = sapply(c(ngrams.1.blogs, 
    ngrams.1.news, ngrams.1.twitter), length)/df.files$words))
file uniques words uwR
blogs 33214 37272578 0.0008911
news 33214 34309642 0.0009681
twitter 27092 30341028 0.0008929
blogs 27092 37272578 0.0007269
news 15302 34309642 0.0004460
twitter 15302 30341028 0.0005043

Here’s a table with the top 10 words by frequency for each file.

knitr::kable(data.frame(blogs.words = ngrams.1.blogs$ngram[1:10], news.words = ngrams.1.news$ngram[1:10], 
    twitter.words = ngrams.1.twitter$ngram[1:10]))
blogs.words news.words twitter.words
the the the
and and you
that for and
for that for
you with that
with said with
was was your
this his are
have but this
but from just

Now let’s remove the stop words (the most common words) and show the list.

knitr::kable(data.frame(blogs.words = head(ngrams.1.blogs$ngram[!(ngrams.1.blogs$ngram %in% 
    stopwords("english"))], 10), news.words = head(ngrams.1.news$ngram[!(ngrams.1.news$ngram %in% 
    stopwords("english"))], 10), twitter.words = head(ngrams.1.twitter$ngram[!(ngrams.1.twitter$ngram %in% 
    stopwords("english"))], 10)))
blogs.words news.words twitter.words
one said just
will will like
like one love
just new get
can also good
time two will
get just now
know can day
now time can
people year thanks

The following graph shows the percentage of those 1-grams that appear only once versus the total of 1-grams.

d <- data.frame(file = df.files$file, all = c(length(ngrams.1.blogs$ngram), 
    length(ngrams.1.news$ngram), length(ngrams.1.twitter$ngram)), repeated = c(sum(ngrams.1.blogs$freq > 
    1), sum(ngrams.1.news$freq > 1), sum(ngrams.1.twitter$freq > 1))) %>% gather(type, 
    count, 2:3)

ggplot(d, aes(file, count, fill = type)) + geom_bar(position = "dodge", stat = "identity") + 
    scale_fill_brewer(palette = "Pastel1")

We want to have the stop words for the prediction algorithm since people use these words when writing (e.g. the, for, you).

Next I got the 2-grams, 3-grams and 4-grams for all of the files.

ngrams.2.blogs <- generateNgrams(corpus.blogs, 2, short = FALSE)
## Time difference of 14.9426 mins
ngrams.2.news <- generateNgrams(corpus.news, 2, short = FALSE)
## Time difference of 15.31443 mins
ngrams.2.twitter <- generateNgrams(corpus.twitter, 2, short = FALSE)
## Time difference of 54.65114 secs
knitr::kable(data.frame(blogs.words = ngrams.2.blogs$ngram[1:10], news.words = ngrams.2.news$ngram[1:10], 
    twitter.words = ngrams.2.twitter$ngram[1:10]))
blogs.words news.words twitter.words
of the of the in the
in the in the for the
to the to the of the
on the on the to be
to be for the on the
and the at the to the
for the and the thanks for
i was in a i love
i have to be going to
and i from the i am
ngrams.3.blogs <- generateNgrams(corpus.blogs, 3, short = FALSE)
## Time difference of 5.435591 mins
ngrams.3.news <- generateNgrams(corpus.news, 3, short = FALSE)
## Time difference of 1.783212 mins
ngrams.3.twitter <- generateNgrams(corpus.twitter, 3, short = FALSE)
## Time difference of 36.64492 secs
knitr::kable(data.frame(blogs.words = ngrams.3.blogs$ngram[1:10], news.words = ngrams.3.news$ngram[1:10], 
    twitter.words = ngrams.3.twitter$ngram[1:10]))
blogs.words news.words twitter.words
one of the one of the thanks for the
a lot of a lot of cant wait to
to be a going to be going to be
out of the as well as thank you for
the end of some of the i want to
a couple of according to the looking forward to
as well as in the first i need to
the fact that out of the for the follow
i want to the end of i love you
i have to part of the to see you
ngrams.4.blogs <- generateNgrams(corpus.blogs, 4, short = FALSE)
## Time difference of 5.685228 mins
ngrams.4.news <- generateNgrams(corpus.news, 4, short = FALSE)
## Time difference of 1.858023 mins
ngrams.4.twitter <- generateNgrams(corpus.twitter, 4, short = FALSE)
## Time difference of 21.19924 secs
knitr::kable(data.frame(blogs.words = ngrams.4.blogs$ngram[1:10], news.words = ngrams.4.news$ngram[1:10], 
    twitter.words = ngrams.4.twitter$ngram[1:10]))
blogs.words news.words twitter.words
the end of the for the first time thanks for the follow
at the end of when it comes to x x x x
the rest of the is one of the thank you for the
one of the most the end of the thanks for the rt
is one of the at the end of cant wait to see
at the same time the rest of the is going to be
to be able to one of the most are you going to
in the middle of at the university of for the first time
if you want to one of the best going to be a
for those of you a member of the hope to see you

Since our app will aim to predict any kind of text and not only that for specific purposes (blogging, writing news, etc), we should aggregate all of the ngrams and frequencies.

ngrams.1.all <- full_join(ngrams.1.blogs, ngrams.1.news) %>% 
  full_join(ngrams.1.twitter) %>% 
  group_by(ngram) %>% 
  summarise(freq = sum(freq)) %>% 
  arrange(desc(freq))
## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")
ngrams.2.all <- full_join(ngrams.2.blogs, ngrams.2.news) %>% 
  full_join(ngrams.2.twitter) %>% 
  group_by(ngram) %>% 
  summarise(freq = sum(freq)) %>% 
  arrange(desc(freq))
## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")
ngrams.3.all <- full_join(ngrams.3.blogs, ngrams.3.news) %>% 
  full_join(ngrams.3.twitter) %>% 
  group_by(ngram) %>% 
  summarise(freq = sum(freq)) %>% 
  arrange(desc(freq))
## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")
ngrams.4.all <- full_join(ngrams.4.blogs, ngrams.4.news) %>% 
  full_join(ngrams.4.twitter) %>% 
  group_by(ngram) %>% 
  summarise(freq = sum(freq)) %>% 
  arrange(desc(freq))
## Joining by: c("ngram", "freq")
## Joining by: c("ngram", "freq")

We want to understand how many n-grams we need to make up a certain percentage of the text (coverage). For this I’ve created the following function:

coverPercentage <- function(d, p, percentage = FALSE) {
  start <- Sys.time()
  
  l <- dim(d)[1]
  p <- sum(d[,2])*p
  
  sum <- 0
  
  for (i in 1:l) {
    sum <- sum + d[i,2] 
    if (sum >= p) {
      end <- Sys.time() - start
      print(end)
      
      if (percentage) {
        return(i/l)
      } else {
        return(i) 
      }
    }
  }
}

The number of 1-grams we need to cover 50% of the text is 307.

I have also created a function to plot this:

plotCoverPercentage <- function(d, range = seq(0,1,0.1)) {
  x = sapply(range, coverPercentage, d = d)
  y = range
  qplot(x = x, y = y, geom = "line", ylab = "Percentage of text covered", xlab = "Number of N-grams", main = "Percentage of text covered \n by most frequent N-grams")
}

Let’s plot the coverage curve for 1-grams:

plotCoverPercentage(ngrams.1.all, range = seq(0,1,0.1))
## Time difference of 0 secs
## Time difference of 0.003998995 secs
## Time difference of 0.01200104 secs
## Time difference of 0.03200006 secs
## Time difference of 0.091856 secs
## Time difference of 0.4341888 secs
## Time difference of 0.9336991 secs
## Time difference of 1.926254 secs
## Time difference of 4.65792 secs
## Time difference of 12.60809 secs
## Time difference of 51.18433 secs

Here’s a function to plot the Wordcloud of the top n n-grams by frequency:

plotWordCloud <- function(d, n = 10) {
  d <- d[1:n,]
  pal <- brewer.pal(8,"Accent")
  wordcloud(words = d$ngram, freq = d$freq, random.order = FALSE, random.color = TRUE, colors = pal, scale = c(3,.8))
}

Let’s see the wordcloud for the top 50 1-grams of all the data.

plotWordCloud(ngrams.1.all, n = 100)

Here’s a function to plot a histogram of the top n-grams:

plotFrequency <- function(d, n = 10) {
  require(ggplot2)
  d = d[1:n,]
  ggplot(data = d, aes(x = reorder(ngram, -freq), y = freq)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Ngram")
}

Finally let’s see the wordcloud for the top 10 1-grams, 2-grams, 3-grams and 4-grams of all the data:

plotFrequency(ngrams.1.all)

plotFrequency(ngrams.2.all)

plotFrequency(ngrams.3.all)

plotFrequency(ngrams.4.all)

Interesting findings

There were no surprises, stop words are always the top 10 words. The vast majority of n-grams are used only one time. The dataset is too big.

Prediction algorithm and Shiny app

I will need to scale the algorithm. For this I will have to get rid of all of those n-grams that are used less than X times (maybe 20 times). And to smooth the model I will have to add plus one to the frequency of each n-gram, even those that have never appeared. I will then have to use a markov-chain model with naive-bayes to make the predicitions.