Introduction

This is a milestone report of Coursera Data Science Course Capstone Project. The goal of this project is to read training data set in order to predict suggested words based on previous input.

In this report, I wil demonstrate that training data from swiftkey will be downloaded and preprocessed for later usage. Brief summaries of each files, exploratory analysis and future implementation plans will be presented in this report.

Preparing Data

First, I downloaded data from coursera website and unzipped it.

Download and Unzip Data Files

if (!file.exists("Coursera-SwiftKey.zip")){
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
                destfile = "Coursera-SwiftKey.zip", method="curl")
  unzip("Coursera-SwiftKey.zip")
}

After extracting data, I took profile of each files; such as the number of lines and words. See summaries below.

Basic Sumamries of Files

In this section, I would like to present my explorartly analysis. Before jumping into other Natural Language Processing Libraries such as RWeka or tm package, I used simple regex function to split tokens.

Blogs

Lines TotalWords SizeMB
1 899288 39120549.00 200.42
MinWord MedianWord MeanWord MaxWord MinCh MedianCh MeanCh MaxCh
1 1.00 29.00 42.50 6851.00 1.00 156.00 230.00 40830.00

News

Lines TotalWords SizeMB
1 1010242 36721087.00 196.28
MinWord MedianWord MeanWord MaxWord MinCh MedianCh MeanCh MaxCh
1 1.00 32.00 35.35 1928.00 1.00 185.00 201.20 11380.00

Twitter

Lines TotalWords SizeMB
1 2360148 32793443.00 159.36
MinWord MedianWord MeanWord MaxWord MinCh MedianCh MeanCh MaxCh
1 1.00 12.00 12.89 46.00 2.00 64.00 68.68 140.00

Profanity Words

In order to filter profanity words, I selected a list from Front Game Media’s Blog Article because of the large number of words. The profainity file has to be cleaned too because contains website information and comma within the cell. The code below extract the list as character vector.

if (!file.exists("Terms-to-Block.csv")){
  download.file("http://www.frontgatemedia.com/new/wp-content/uploads/2014/03/Terms-to-Block.csv", 
                destfile = "Terms-to-Block.csv", method="curl")  
}
profanity<-gsub(",","",read.csv("Terms-to-Block.csv")[-(1:3),2])

Here is the number of profanity words

## [1] 723

Cleaning and Sampling Dataset

In this section, I am running script to generate sample / training set from the original data. The function, createTrainingSet generates 5% and 0.5% of data for later usage in this report.

createTrainingSet<-function(src_path,target_path,filter_words,ratio){
  lines<-readLines(src_path, skipNul=TRUE, encoding = "UTF-8")
  total<-length(lines)
  filter_regex<-paste(filter_words,collapse="|")
  lines<-sample(lines,total*ratio)
  lines<-gsub(filter_regex,"",lines)
  write(lines,file=target_path,sep="\t")
  rm(lines)
}
profanity<-gsub(",","",read.csv("Terms-to-Block.csv")[-(1:3),2])

# Sample Set: 5%
if (!file.exists("data/blog.csv")){
  createTrainingSet('final/en_US/en_US.blogs.txt','data/blog.csv',profanity,0.05)
  createTrainingSet('final/en_US/en_US.news.txt','data/news.csv',profanity,0.05)
  createTrainingSet('final/en_US/en_US.twitter.txt','data/twitter.csv',profanity,0.05)
  
  # Second Sample Set: 0.5%
  # Create Small Set for Wordcloud in Interesting Finding Section
  createTrainingSet('final/en_US/en_US.blogs.txt','data/blog_sm.csv',profanity,0.005)
  createTrainingSet('final/en_US/en_US.news.txt','data/news_sm.csv',profanity,0.005)
  createTrainingSet('final/en_US/en_US.twitter.txt','data/twitter_sm.csv',profanity,0.005)
}

Features of Data

I obtained clean and handy sample datasets for blog, news and twitter. These are basic tokeniers for unigram, bigram and trigram so that we can build corpus for each dataset.

library(tm)
library(RWeka)

options(mc.cores=1)
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

Next, let’s build a function to read data files and generate histogram for each n-gram.

After displaying histogram, Corpus object would be built and Text Document Matrix of Trigram is shown below.

Note: TermDocumentMatrix errored in Rmd generator. I’m pasting the same result from my console.

Blog

Term Document Matrix

Non-/sparse entries: 103264/189682686
Sparsity           : 100%
Maximal term length: 143

News

Term Document Matrix

Non-/sparse entries: 101329/221472110
Sparsity           : 100%
Maximal term length: 54

Twitter

Term Document Matrix

Non-/sparse entries: 86865/444600595
Sparsity           : 100%
Maximal term length: 70

Interesting Findings

Wordcoloud is a visualization technique to emphasize the font size of frequent occuring words. This is another intuitive way to interpret how words are used in the context.

As you see below, news repeateadly used word “said” while twitter scatteres wide range of vocavularies. Especially, I see positive emotional words, such as “good”, “love” and “like”.

Blog

News

Twitter

Future Plans and Final Product

In production, 70% of the original dataset would be used. After building Corpus and Term Document Matrix from those files, I would check associated words using above NGramTokenizer.

Here is the demo of top 10 associated words of “today” in the twitter corpus.

Note: TDM crashes in Rmd format. I’m pasting the result after this segment.

path<-'data/twitter_sm.csv'
cps <- Corpus(DataframeSource(read.csv(path,sep='\t'))) 
cps <- tm_map(cps, removePunctuation)
cps <- tm_map(cps, function(x) removeWords(x, stopwords("english")))
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm <- TermDocumentMatrix(cps, control = list(tokenize = UnigramTokenizer))
head(findAssocs(tdm,"today",.5),10)
       today
day    0.95
like   0.95
dont   0.94
love   0.94
now    0.94
will   0.94
call   0.93
even   0.93
going  0.93
great  0.93

In order to imporove the accuracy, following smoothing techniques will be examined:

  1. Laplace Smoothing Simply Adding one (or normaized value between 0 and 1) so that we can avoid zero to estimate unknown combinations.
  2. Removing Sparse Data (Rare occuring combination of words tend to be zero probability in training, so they will be removed)
  3. Backoff models (e.g. if trigram is not found, backoff to bigram while we assume they will be appearing as same frequency)

The final product will be uploaded as Shyny App. I will provide user interface where the user can type in text, and prediction of words will be listed next to the textbox.