1 Executive Summary

The goal of this project is just to display that I’ve gotten used to working with the SwiftKey, Inc. data and that I’m on track to create the prediction algorithm. I will explain my exploratory analysis and my goals for the eventual app and algorithm.

The objective of this project is to: (1). Demonstration of how to download the data and have successfully loaded it in, (2). Create a basic report of summary statistics about the data sets, (3). Report any interesting findings that I amassed so far, and (4). Get feedback on my plans for creating a prediction algorithm and Shiny app.

2 Data Acquisition

I have acquired the data from given URL, Coursera-SwiftKey.zip. Further, unzipped the data in to data directory, by help of bash command unzip i.e., is used for opening the archive; unzip("./data/Coursera-SwiftKey.zip").

# Remove Objects from a Specified Environment.
rm(list = ls())

3 Reading the SwiftKey Data

I’m using here readLines method, with encoding = "UTF-8", skipNul = TRUE parameters to load/read en_US data sets, i.e., en_US.blogs.txt/en_US.news.txt/en_US.twitter.txt.

enBlogs   <- readLines("./data/final/en_US/en_US.blogs.txt",   encoding = "UTF-8", skipNul = TRUE)
enNews    <- readLines("./data/final/en_US/en_US.news.txt",    encoding = "UTF-8", skipNul = TRUE)
enTwitter <- readLines("./data/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

4 Cleaning the SwiftKey Data

After, loading the data, I did cleaning of the acquired data. For this work, I’ve used stringi R library.

Note: Upon examination, there is some removal of non english characters to need to be done to make the data more usable.

# working with stringi:
# http://www.rexamine.com/resources/stringi/
# Author: Marek Gagolewski, gagolews@rexamine.com/
library(stringi)
## Warning: package 'stringi' was built under R version 3.1.3
# some removal of non english characters to need to be done to make the data more usable.
# drop non UTF-8 characters.
blogs   <- iconv(enBlogs,   from = "latin1", to = "UTF-8", sub = "")
news    <- iconv(enNews,    from = "latin1", to = "UTF-8", sub = "")
twitter <- iconv(enTwitter, from = "latin1", to = "UTF-8", sub = "")

# Replace Occurrences of a Pattern.
twitter <- stri_replace_all_regex(twitter, "\u2019|`","'")
twitter <- stri_replace_all_regex(twitter, "\u201c|\u201d|u201f|``",'"')

# Save a single object to file.
saveRDS(blogs,   "./enBlogs.rds")
saveRDS(news,    "./enNews.rds")
saveRDS(twitter, "./enTwitter.rds")

# A data frame is a list of variables of the same number of rows with unique row names,
# given class "data.frame". If no variables are included,
# the row names determine the number of rows.
data.frame(blogs = length(blogs), news = length(news), twitter = length(twitter))
##    blogs    news twitter
## 1 899288 1010242 2360148

5 Summary Statistics

I have performed Summary Statistics for three files only - en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.

# Summary Statistics for the 3 files - en_US.blogs/en_US.news/en_US.twitter
blogwords    <- sum(stri_count_words(blogs))
newswords    <- sum(stri_count_words(news))
twitterwords <- sum(stri_count_words(twitter))
# Combine Values into a Vector or List,
# This is a generic function which combines its arguments.
words        <- c(blogwords, newswords, twitterwords)

bloglines    <- length(blogs)
newslines    <- length(news)
twitterlines <- length(twitter)
# Combine Values into a Vector or List,
# This is a generic function which combines its arguments.
lines        <- c(bloglines, newslines, twitterlines)

blogmaxc     <- max(nchar(blogs))
newsmaxc     <- max(nchar(news))
twittermaxc  <- max(nchar(twitter))
# Combine Values into a Vector or List,
# This is a generic function which combines its arguments.
maxchars     <- c(blogmaxc, newsmaxc, twittermaxc)

blogmaxw     <- max(stri_count_words(blogs))
newsmaxw     <- max(stri_count_words(news))
twittermaxw  <- max(stri_count_words(twitter))
# Combine Values into a Vector or List,
# This is a generic function which combines its arguments.
maxwords     <- c(blogmaxw, newsmaxw, twittermaxw)

FileSumm     <- data.frame("File Name" = c("en_US.blogs", "en_US.news", "en_US.twitter"),
                           NumberLines = lines,
                           NumberWords = words,
                           MaxChars = maxchars,
                           MaxWords = maxwords)

Also, I did some basic descriptive measures of the size of the text.

# Basic descriptive measures of the size of the text.
# working with knitr:
library(knitr)
## Warning: package 'knitr' was built under R version 3.1.3
basic.measures <- cbind(c("Text Chunks", "Characters"),
                  rbind(c(length(blogs),length(news),length(twitter)),
                        c(sum(nchar(blogs)),sum(nchar(news)),sum(nchar(twitter)))))
colnames(basic.measures) <- c("Measure", "Blogs", "News", "Twitter")

kable(basic.measures)
Measure Blogs News Twitter
Text Chunks 899288 1010242 2360148
Characters 208361438 203791405 162385035

6 Data Sampling

Because the files are so large, I will take only a random 05% sample of the data, from all three data sets. This will decrease overall computation time and allow for quicker experimentation with the data.

# Sampling the SwiftKey Data.
# Setting the seed for reproducibility.
set.seed(100)

# This function takes a random sample of the data.
sampler <- function(chunk, percent) {
  percent <- round(length(chunk)*percent)
  sample.index <- sample(1:length(chunk), percent)
  
  return(chunk[sample.index])
}

# Let's start off with a 05% sample data.
#                       -----
# //1-US.blogs//
sampleBlogs <- sampler(blogs, .05)
# Write Lines to a Connection.
writeLines(c(sampleBlogs), "./sampleBlogs.txt")

# //2-US.news//
sampleNews <- sampler(news, .05)
# Write Lines to a Connection.
writeLines(c(sampleNews), "./sampleNews.txt")

# //3-US.twitter//
sampleTwitter <- sampler(twitter, .05)
# Write Lines to a Connection.
writeLines(c(sampleTwitter), "./sampleTwitter.txt")

Note: Let’s take a look at some basic descriptive measures of the size of the sample text.

# Basic descriptive measures of the size of sample text.
# working with knitr:
library(knitr)

basic.measures.1 <- cbind(c("Text Chunks", "Characters"),
                    rbind(c(length(sampleBlogs),length(sampleNews),length(sampleTwitter)),
                          c(sum(nchar(sampleBlogs)),sum(nchar(sampleNews)),sum(nchar(sampleTwitter)))))
colnames(basic.measures.1) <- c("Measure", "Blogs", "News", "Twitter")

kable(basic.measures.1)
Measure Blogs News Twitter
Text Chunks 44964 50512 118007
Characters 10355850 10141823 8124800

7 Profanity Filtering

Now, this sampled data set is quite good and likely has a few things I can take out. For starters, profanity filtering is a good idea. I will reformat and combine two profanity lists first. Then, I will take out any text chunks that contain anything on the list.

Note: The lists used can be found here - 1 & 2.

# Profanity Filtering of SwiftKey Sample Data.
# First list of bad words:------------------------
bad.words.1 <- readLines("./data/Bad_Words_1.txt")
bad.words.1 <- bad.words.1[-length(bad.words.1)]

# Second list of bad words:-----------------------
bad.words.2 <- readLines("./data/Bad_Words_2.txt")
bad.words.2 <- bad.words.2[-1]

bad.words.2 <- substr(x = bad.words.2, start = 1, stop = nchar(bad.words.2)-3)
double.quote.index <- grep(pattern = "\"", x = bad.words.2)

bad.words.2[double.quote.index] <- substr(x = bad.words.2[double.quote.index], start = 2, 
                                          stop = nchar(bad.words.2[double.quote.index])-1)

all.bad.words <- c(bad.words.1, bad.words.2)
all.bad.words <- unique(all.bad.words)
all.bad.words <- paste(all.bad.words, collapse="|")
all.bad.words <- substr(x = all.bad.words, start = 1, stop = nchar(all.bad.words)-1)

bad.words.blogs   <- grep(all.bad.words, sampleBlogs)
bad.words.news    <- grep(all.bad.words, sampleNews)
bad.words.twitter <- grep(all.bad.words, sampleTwitter)

bad.prop.blogs    <- length(bad.words.blogs)/length(sampleBlogs)
bad.prop.news     <- length(bad.words.news)/length(sampleNews)
bad.prop.twitter  <- length(bad.words.twitter)/length(sampleTwitter)

sampleBlogs   <- sampleBlogs[-bad.words.blogs]
sampleNews    <- sampleNews[-bad.words.news]
sampleTwitter <- sampleTwitter[-bad.words.twitter]

Note: Now, let’s take a look at the size of the new data set and what percentage of the chunks had profanities.

# What percentage of the chunks has profanities?
# ˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆ
basic.measures.2 <- cbind(c("Text Chunks", "Characters", "Profanity Percent"),
                    rbind(c(length(sampleBlogs),length(sampleNews),length(sampleTwitter)),
                          c(sum(nchar(sampleBlogs)),sum(nchar(sampleNews)),sum(nchar(sampleTwitter))),
                          round(c(bad.prop.blogs,bad.prop.news,bad.prop.twitter), 4)))
colnames(basic.measures.2) <- c("Measure", "Blogs", "News", "Twitter")

kable(basic.measures.2)
Measure Blogs News Twitter
Text Chunks 34456 38758 104273
Characters 6025558 6952821 6980216
Profanity Percent 0.2337 0.2327 0.1164

8 Homogeneity in the Data

With the reduced clean data set; Now, I can focus on dealing with numbers and individual sentences. I decided to replace all of the numbers in the data with generic “xnumber” and “xdollar” strings. By trading the unique numbers for standard markers, I have created some homogeneity in the data.

Note: This will make it easier to count and analyze terms right before and after the markers.

The text chunks each contain multiple sentences and distinct word combinations. Simply taking out punctuation and merging the chunks will distort the true order of the words. Ideas are typically separated with periods, commas, semicolons, and colons. I decided to split the data on these to preserve the associations between groups of words and prevent unrelated words from merging.

The following code chunk creates functions that deal with generic numbers and sentence splitting.

Note: These are then combined and applied to the data [output size: ~19.3 MB].

# Creating Homogeneity in the Data.
# Replace numbers and currency with generic "xnumber" and "xdollars" markers.
generic.numbers <- function(chunk) {
  chunk <- gsub(pattern = "[$](([0-9]+)|([,]*))+ +", replacement = "xdollars ", x = chunk)
  chunk <- gsub(pattern = "\\d+", replacement = "xnumber", x = chunk)
  chunk <- gsub(pattern = "xnumber,xnumber", replacement = "xnumber", x = chunk)
  
  return(chunk)
}

## Separate text into distinc sentences based on (. , ; :).
sentence.splitter <- function(chunk) {
  if(grepl("(*)[.]|[,]|[;];[:] [A-Z](*)", chunk))
    return(strsplit(chunk, "(*)[.]|[,]|[;]|[:] (*)"))
  else(return(chunk))
}

## Combining above functions to get cleaner lists of words.
cleaner <- function(chunk) {
  clean.chunks <- vector(mode = "character")
  for(i in 1:length(chunk)) {
    clean.chunks <- c(clean.chunks, sentence.splitter(generic.numbers(chunk[i]))[[1]])
  }
  # Remove leading and trailing whitespace.
  clean.chunks <- gsub("^\\s+|\\s+$", "", clean.chunks)
  
  return(clean.chunks)
}

# Applying cleaner function, to get cleaner lists of words.
enBlogsList     <- cleaner(sampleBlogs);     rm(sampleBlogs);
enNewsList      <- cleaner(sampleNews);       rm(sampleNews);
enTwitterList   <- cleaner(sampleTwitter); rm(sampleTwitter);

# Removing empty entries from the list.
enBlogsList     <- enBlogsList[enBlogsList != ""]
enNewsList      <- enNewsList[enNewsList != ""]
enTwitterList   <- enTwitterList[enTwitterList != ""]

# To Write Cleaned Sample Data.
# Ω - 19.3 MB
writeLines(c(enBlogsList, enNewsList, enTwitterList), "./clean_sample_data.txt")

9 Computational Linguistics

Now, with this reduced and consistent data structure, I can look at how often groups of words show up. I will focus on frequencies of individual terms, two terms, and three terms. These are known as n-grams and they are a common tool in computational linguistics.

The following two functions create the n-grams and their frequency counts. These are applied to the,

  1. blogs data,
  2. news data, and
  3. twitter data

to produce data for plotting, through n-gram analysis.

# Computational Linguistics.
# A function that creates n-grams.
# ================================
# an n-gram is a contiguous sequence of n-items from a given sequence of text or speech.
# An n-gram of size 1 is referred to as a "unigram";
# size 2 is a "bigram" (or, less commonly, a "digram");
# size 3 is a "trigram".
n.gram <- function(sentence, n) {
  sent <- strsplit(sentence, split = " ")
  
  if(length(sent[[1]]) < n)
    return()
  ns <- vector(mode = "character", length = length(sent[[1]])-n+1)
  for(i in 1:(length(sent[[1]])-n+1)) {
    ns[i] <- paste((sent[[1]][i:(i+n-1)]), collapse = " ")
  }
  
  return(ns)
}

## n-gram Example.
n.gram(enTwitterList[100], 2)
##  [1] "RT :"          ": FILL"        "FILL IN"       "IN THE"       
##  [5] "THE BLANKS!"   "BLANKS! Right" "Right now"     "now my"       
##  [9] "my finger"     "finger smells" "smells like"   "like Newport" 
## [13] "Newport lol"
## Returning a table of frequency counts for n-grams.
n.gram.table <- function(word_list, n) {
  n.gram.medium <- sapply(word_list, n.gram, n = n)
  n.gram.medium <- table(unlist(n.gram.medium))
  
  props <- n.gram.medium/sum(n.gram.medium)
  
  n.gram.medium <- data.frame(n.gram.medium, props)
  colnames(n.gram.medium) <- c("N.Gram", "Freq"," ", "Prop")
  n.gram.medium <- n.gram.medium[order(-n.gram.medium$Prop),]
  
  #--// Setup for plotting //--
  n.gram.medium$N.Gram <- factor(n.gram.medium$N.Gram, levels = n.gram.medium$N.Gram[order(n.gram.medium$Freq)])
  
  return(n.gram.medium[,-3])
}
#--// Setup for blogs n-grams //-
blogs.1.gram <- n.gram.table(enBlogsList, n = 1)
blogs.2.gram <- n.gram.table(enBlogsList, n = 2)
blogs.3.gram <- n.gram.table(enBlogsList, n = 3)
#--// Setup for news n-grams //-
news.1.gram <- n.gram.table(enNewsList, n = 1)
news.2.gram <- n.gram.table(enNewsList, n = 2)
news.3.gram <- n.gram.table(enNewsList, n = 3)
#--// Setup for twitter n-grams //-
twitter.1.gram <- n.gram.table(enTwitterList, n = 1)
twitter.2.gram <- n.gram.table(enTwitterList, n = 2)
twitter.3.gram <- n.gram.table(enTwitterList, n = 3)

10 Plotting of the SwiftKey Data

Finally, I can plot the relative proportions of the n-grams. I choose to plot proportions because they highlight the popularity of the n-grams relative to each other and relative to the whole data set. The top ten most frequent n-grams are displayed.

Note: The twitter data are chosen here specifically but this can easily be applied to the other data sets, also.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
# http://ggplot2.org/
library(RColorBrewer)
# http://colorbrewer2.org/

# A function that plot n-grams.
n.gram.plot <- function(n.gram.df, topn, name) {
  custom.pal <- colorRampPalette(brewer.pal(6,"Blues"))(15)
  
  ggplot(n.gram.df[1:topn,], aes(x = N.Gram, y = Prop, fill = Prop)) + 
    geom_bar(stat = "identity") + 
    scale_fill_gradient(low = custom.pal[5], high = custom.pal[10]) + 
    ggtitle(name) + 
    coord_flip() + 
    theme(legend.position = "none")
}
n.gram.plot(twitter.1.gram, 10, "Twitter unigram Relative Proportions.")

n.gram.plot(twitter.2.gram, 10, "Twitter bigram Relative Proportions.")

n.gram.plot(twitter.3.gram, 10, "Twitter trigram Relative Proportions.")

11 Interesting Findings

Looking at the graphs above, We can clearly see that as the “n” in n-gram increases, the absolute proportions fall precipitously. Also, as “n” increases the proportions seem to flatten out quickly. The top two 3-gram entries are actually saying the same thing but the capital “T” leads to two different counts. This is something to keep in mind going forward.

12 Plans for Prediction Algorithm

Good Sense: Clean and reliable data are key to building a good model.

Scrubbing the data meticulously should be the first step in building the algorithm. The code above is a good start to cleaning the data but it still needs some work. Some issues to consider going forward are:

  1. Capital and lowercase letters/terms in n-grams,
  2. Better structures to hold and manipulate the data,
  3. Dealing with currency, numbers, times, and dates correctly, and
  4. Spellchecking and handling misspellings.

Note: I will also explore 4 word combinations and may be able to be even more accurate at that level.

I am exploring several options for the model’s structure. A simple starting point is to predict using the last word from observed n-grams. Several predictions could be ranked in order of their observed past frequencies. The most frequently observed last word from the n-gram will be the first prediction. This model is pretty simple when sticking to n-grams of a particular size, like two.

However, larger n-grams still need to be considered along with their interactions with smaller n-grams. Overall differences between the text sources need to be analyzed before the data are lumped together into the model. These include difference in style, sentence structure, term frequencies, spelling, and n-gram distributions. This analysis is beyond the scope of this milestone report, but it will be considered during subsequent steps in the project.

Created by: Prabhat Kumar, 18-March-2016.

Thanks.