This is a milestone report of Coursera Data Science Course Capstone Project. The goal of this project is to read training data set in order to predict suggested words based on previous input.
In this report, I wil demonstrate that training data from swiftkey will be downloaded and preprocessed for later usage. Brief summaries of each files, exploratory analysis and future implementation plans will be presented in this report.
First, I downloaded data from coursera website and unzipped it.
if (!file.exists("Coursera-SwiftKey.zip")){
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "Coursera-SwiftKey.zip", method="curl")
unzip("Coursera-SwiftKey.zip")
}
After extracting data, I took profile of each files; such as the number of lines and words. See summaries below.
In this section, I would like to present my explorartly analysis. Before jumping into other Natural Language Processing Libraries such as RWeka or tm package, I used simple regex function to split tokens.
| Lines | TotalWords | SizeMB | |
|---|---|---|---|
| 1 | 899288 | 39120549.00 | 200.42 |
| MinWord | MedianWord | MeanWord | MaxWord | MinCh | MedianCh | MeanCh | MaxCh | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1.00 | 29.00 | 42.50 | 6851.00 | 1.00 | 156.00 | 230.00 | 40830.00 |
| Lines | TotalWords | SizeMB | |
|---|---|---|---|
| 1 | 1010242 | 36721087.00 | 196.28 |
| MinWord | MedianWord | MeanWord | MaxWord | MinCh | MedianCh | MeanCh | MaxCh | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1.00 | 32.00 | 35.35 | 1928.00 | 1.00 | 185.00 | 201.20 | 11380.00 |
| Lines | TotalWords | SizeMB | |
|---|---|---|---|
| 1 | 2360148 | 32793443.00 | 159.36 |
| MinWord | MedianWord | MeanWord | MaxWord | MinCh | MedianCh | MeanCh | MaxCh | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1.00 | 12.00 | 12.89 | 46.00 | 2.00 | 64.00 | 68.68 | 140.00 |
In order to filter profanity words, I selected a list from Front Game Media’s Blog Article because of the large number of words. The profainity file has to be cleaned too because contains website information and comma within the cell. The code below extract the list as character vector.
if (!file.exists("Terms-to-Block.csv")){
download.file("http://www.frontgatemedia.com/new/wp-content/uploads/2014/03/Terms-to-Block.csv",
destfile = "Terms-to-Block.csv", method="curl")
}
profanity<-gsub(",","",read.csv("Terms-to-Block.csv")[-(1:3),2])
Here is the number of profanity words
## [1] 723
In this section, I am running script to generate sample / training set from the original data. The function, createTrainingSet generates 5% and 0.5% of data for later usage in this report.
createTrainingSet<-function(src_path,target_path,filter_words,ratio){
lines<-readLines(src_path, skipNul=TRUE, encoding = "UTF-8")
total<-length(lines)
filter_regex<-paste(filter_words,collapse="|")
lines<-sample(lines,total*ratio)
lines<-gsub(filter_regex,"",lines)
write(lines,file=target_path,sep="\t")
rm(lines)
}
profanity<-gsub(",","",read.csv("Terms-to-Block.csv")[-(1:3),2])
# Sample Set: 5%
if (!file.exists("data/blog.csv")){
createTrainingSet('final/en_US/en_US.blogs.txt','data/blog.csv',profanity,0.05)
createTrainingSet('final/en_US/en_US.news.txt','data/news.csv',profanity,0.05)
createTrainingSet('final/en_US/en_US.twitter.txt','data/twitter.csv',profanity,0.05)
# Second Sample Set: 0.5%
# Create Small Set for Wordcloud in Interesting Finding Section
createTrainingSet('final/en_US/en_US.blogs.txt','data/blog_sm.csv',profanity,0.005)
createTrainingSet('final/en_US/en_US.news.txt','data/news_sm.csv',profanity,0.005)
createTrainingSet('final/en_US/en_US.twitter.txt','data/twitter_sm.csv',profanity,0.005)
}
I obtained clean and handy sample datasets for blog, news and twitter. These are basic tokeniers for unigram, bigram and trigram so that we can build corpus for each dataset.
library(tm)
library(RWeka)
options(mc.cores=1)
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
Next, let’s build a function to read data files and generate histogram for each n-gram.
After displaying histogram, Corpus object would be built and Text Document Matrix of Trigram is shown below.
Note: TermDocumentMatrix errored in Rmd generator. I’m pasting the same result from my console.
Term Document Matrix
Non-/sparse entries: 103264/189682686
Sparsity : 100%
Maximal term length: 143
Term Document Matrix
Non-/sparse entries: 101329/221472110
Sparsity : 100%
Maximal term length: 54
Term Document Matrix
Non-/sparse entries: 86865/444600595
Sparsity : 100%
Maximal term length: 70
Wordcoloud is a visualization technique to emphasize the font size of frequent occuring words. This is another intuitive way to interpret how words are used in the context.
As you see below, news repeateadly used word “said” while twitter scatteres wide range of vocavularies. Especially, I see positive emotional words, such as “good”, “love” and “like”.
In production, 70% of the original dataset would be used. After building Corpus and Term Document Matrix from those files, I would check associated words using above NGramTokenizer.
Here is the demo of top 10 associated words of “today” in the twitter corpus.
Note: TDM crashes in Rmd format. I’m pasting the result after this segment.
path<-'data/twitter_sm.csv'
cps <- Corpus(DataframeSource(read.csv(path,sep='\t')))
cps <- tm_map(cps, removePunctuation)
cps <- tm_map(cps, function(x) removeWords(x, stopwords("english")))
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm <- TermDocumentMatrix(cps, control = list(tokenize = UnigramTokenizer))
head(findAssocs(tdm,"today",.5),10)
today
day 0.95
like 0.95
dont 0.94
love 0.94
now 0.94
will 0.94
call 0.93
even 0.93
going 0.93
great 0.93
In order to imporove the accuracy, following smoothing techniques will be examined:
The final product will be uploaded as Shyny App. I will provide user interface where the user can type in text, and prediction of words will be listed next to the textbox.