Birgit Kiesewetter, August 2016
This is the Milestone Report for the Data Science Capstone Project. The goal of this project is to create a predictive text model to suggest 3 optional words like those from SwiftKey, the corporate partner for this Capstone Project. For example the user types “I went to the” on his smartphone, the model should present 3 words what the next word might be, the user wants to type. In this case it could be gym, restaurant and store.
The data sets to build the model on are available in the archive file Coursera-SwiftKey.zip (Size: 550 MB)
The file includes for 4 languages (American English, Finnish, Russian and German) 3 files from different sources (Twitter, Blogs and News). The data analysis below, will focus only on the American English files found in final/en_US: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt
# libraries needed
library(ggplot2); library(stringi);
library(stringr); library(caTools);
library(tm); library(RWeka); library(plyr);
# this dir includes the relevant files on my desktop
home <- file.path("C:", "Users/Birgit/Desktop/Swift")
setwd(home)
# loading
twitter_US <- readLines("./OriginalUS/en_US.twitter.txt", encoding="UTF-8")
blogs_US <- readLines("./OriginalUS/en_US.blogs.txt", encoding="UTF-8")
news_US <- readLines("./OriginalUS/en_US.news.txt", encoding="UTF-8")
The line, word counts and size of the files are:
en_US.twitter.txt: 2360148 lines 30353372 words / size 200 MB
en_US.blogs.txt: 899288 lines / 37334131 words / size 196 MB
en_US.news.txt: 77259 lines / 2643969 words / size 159 MB
Due to the different word and line counts we will work with 2% of the twitter, 3% blogs and 5% news sample lines to have a good combination of all three sources. From this set we split around 10% of for dev and test data sets. Seed is set to keep the random sampling reproducible:
#getting samples of 2% of twitter, 3% blogs and 5% news
set.seed(1555) # for reproducibility
sample_twitter <- twitter_US[rbinom(length(twitter_US), 1, 0.02) == 1]
sample_blogs<- blogs_US[rbinom(length(blogs_US), 1, 0.03) == 1]
sample_news <- news_US[rbinom(length(news_US), 1, 0.05) == 1]
#10% for test
set.seed(1001)
#twitter
sample = sample.split(sample_twitter, SplitRatio = .9)
twittertrain = subset(sample_twitter , sample == TRUE)
twittertest = subset(sample_twitter, sample == FALSE)
#blogs
sample = sample.split(sample_blogs, SplitRatio = .9)
blogstrain = subset(sample_blogs , sample == TRUE)
blogstest = subset(sample_blogs, sample == FALSE)
#news
sample = sample.split(sample_news, SplitRatio = .9)
newstrain = subset(sample_news , sample == TRUE)
newstest = subset(sample_news, sample == FALSE)
#10% of remaining train for dev
#twitter
sample = sample.split(twittertrain, SplitRatio = .9)
twittertrain = subset(twittertrain , sample == TRUE)
twitterdev = subset(twittertrain, sample == FALSE)
#blogs
sample = sample.split(blogstrain, SplitRatio = .9)
blogstrain = subset(blogstrain , sample == TRUE)
blogsdev = subset(blogstrain, sample == FALSE)
#news
sample = sample.split(newstrain, SplitRatio = .9)
newstrain = subset(newstrain , sample == TRUE)
newsdev = subset(newstrain, sample == FALSE)
Below are the number of lines of the 4 resulting sample data sets: sample, train, test and dev:
## source sample train dev test
## 1 twitter 46900 37989 4221 4690
## 2 blogs 27128 21973 2442 2713
## 3 news 3880 3142 350 388
From now on only the train data set is used to analyse the data and build later on the predictive algorithm on. Test and Dev is left out for cross validation and final testing on new data.
The code below does the following data cleaning:
Additional thoughts on cleaning:
# reading swearwords from http://www.bannedwordlist.com/lists/swearWords.txt
toberemoved <- readLines("./other/swearWords.txt", encoding="UTF-8")
# functions for cleaning the corpus
removeGraphs <- function(x) {
str_replace_all(x,"[^[:graph:]]", " ")
}
cleanup <- function(x) {
clean <- tm_map(x , removeNumbers)
clean <- tm_map(clean, removeGraphs)
clean <- tm_map(clean, iconv, to="ASCII", sub=" ")
clean <- tm_map(clean, removePunctuation)
clean <- tm_map(clean, tolower)
clean <- tm_map(clean, removeWords, toberemoved)
clean <- tm_map(clean , stripWhitespace)
clean <- tm_map(clean, PlainTextDocument)
}
The final training corpus called “clean” consists of the 3 documents and has a size of 11.8 MB. This corpus is generated from the training data set only. This has been saved after sampling into directory train/formodel.
#creating train corpus
data <- VCorpus(DirSource("./train/formodel", encoding = "UTF-8"),
readerControl = list(language = "en")) # 11.8 Mb
# cleaning train corpus
clean <- cleanup(data) # see function above
word_matrix <- DocumentTermMatrix(clean, control=list(wordLengths=c(1,Inf)))
wordFreq <- colSums(as.matrix(word_matrix))
wf <- data.frame(word = names(wordFreq), freq = wordFreq)
wf <- wf[order(-wf[,2]),]
## word freq
## 1 the 66624
## 2 to 41902
## 3 and 36254
## 4 a 34255
## 5 i 30970
## 6 of 29597
The analysis and histogram shows that there are a few words with a very high frequency:
the = 66624, to = 41902, and = 36254.
38819 words out of 69915 unique words just appear once. 60,582 do have a frequency below 10.
To cover 50% of all words included in this set, just 120 unique words are needed. To cover 90%, 6921 unique words are needed.
From the corpus now bigrams (2 word sequences) and trigrams (3 word sequences) are generated using NGramTokenizer from RWeka Package.
# function for rowsum, dataframe and sorting
processing <- function(x){
token <- rowSums(as.matrix(x))
token <- data.frame(term = names(token), freq = token)
token <- token[order(-token[,2]),]
}
options(mc.cores=1)
# Bigrams
BigramToken <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bitoken <- TermDocumentMatrix(clean, control = list(tokenize = BigramToken))
bf <- processing(bitoken)
#Trigrams
TrigramToken <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tritoken <- TermDocumentMatrix(clean, control = list(tokenize = TrigramToken))
tf <- processing(tritoken)
Looking at the histograms below for the bigrams, a few terms exist very frequent like ‘of the’ and ‘in the’ with over 5000. Further analysis gives that the total number of unique bigram phrases is around 60000 and out of those around 45000 (79%) only exist once.
The trigram distribution gives a similar picture: We have phrases “one of the” and “a lot of” that appear more than 400 times. The total number of unique terms is around 1000000, where around 990000 only exist one-time (92%).
Further I also analysed fourgrams as well where 97% of all phrases have a frequency of 1. The corresponding plot is not included in this report.
## term freq
## 1 of the 6108
## 2 in the 5579
## 3 to the 3120
## 4 on the 2849
## 5 for the 2736
## 6 to be 2528
## term freq
## 1 one of the 476
## 2 a lot of 437
## 3 thanks for the 365
## 4 i want to 287
## 5 going to be 256
## 6 to be a 250
Based on the above analysis on word and phrase frequency distributions, my first model approach will uses only the trigrams and bigrams and finally the 3 most common words: the, to, and.
Those are build together using a “Stupid Backoff Model”. Following the documentation, this simple and inexpensive model seem to give quite good prediction results.
Testing need to be done on whether a better algorithm could be implemented to deal with “unseen” phrases than just backing off to a lower n-gram. Smoothing and/or Interpolation methods could be considered here.
The final Shiny Application will have a friendly user interface, where the user can type a phrase and gets 3 words suggested. Those can be selected and added by clicking them.
References: