Milestone Report For Word Suggestion Project

Birgit Kiesewetter, August 2016

This is the Milestone Report for the Data Science Capstone Project. The goal of this project is to create a predictive text model to suggest 3 optional words like those from SwiftKey, the corporate partner for this Capstone Project. For example the user types “I went to the” on his smartphone, the model should present 3 words what the next word might be, the user wants to type. In this case it could be gym, restaurant and store.

The data sets to build the model on are available in the archive file Coursera-SwiftKey.zip (Size: 550 MB)

The file includes for 4 languages (American English, Finnish, Russian and German) 3 files from different sources (Twitter, Blogs and News). The data analysis below, will focus only on the American English files found in final/en_US: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt

1) Loading Files and Required Libraries

# libraries needed
library(ggplot2); library(stringi);
library(stringr); library(caTools);
library(tm); library(RWeka); library(plyr);

# this dir includes the relevant files on my desktop
home <- file.path("C:", "Users/Birgit/Desktop/Swift")
setwd(home)
# loading
twitter_US <- readLines("./OriginalUS/en_US.twitter.txt", encoding="UTF-8")
blogs_US <- readLines("./OriginalUS/en_US.blogs.txt",  encoding="UTF-8")
news_US <- readLines("./OriginalUS/en_US.news.txt",  encoding="UTF-8")

The line, word counts and size of the files are:
en_US.twitter.txt: 2360148 lines 30353372 words / size 200 MB
en_US.blogs.txt: 899288 lines / 37334131 words / size 196 MB
en_US.news.txt: 77259 lines / 2643969 words / size 159 MB

2) Sampling Files

Due to the different word and line counts we will work with 2% of the twitter, 3% blogs and 5% news sample lines to have a good combination of all three sources. From this set we split around 10% of for dev and test data sets. Seed is set to keep the random sampling reproducible:

#getting samples of 2% of twitter, 3% blogs and 5% news
set.seed(1555) # for reproducibility
sample_twitter <- twitter_US[rbinom(length(twitter_US), 1, 0.02) == 1]
sample_blogs<- blogs_US[rbinom(length(blogs_US), 1, 0.03) == 1]
sample_news <- news_US[rbinom(length(news_US), 1, 0.05) == 1]

#10% for test
set.seed(1001) 
#twitter
sample = sample.split(sample_twitter, SplitRatio = .9)
twittertrain = subset(sample_twitter , sample == TRUE)
twittertest = subset(sample_twitter, sample == FALSE)
#blogs
sample = sample.split(sample_blogs, SplitRatio = .9)
blogstrain = subset(sample_blogs , sample == TRUE)
blogstest = subset(sample_blogs, sample == FALSE)
#news
sample = sample.split(sample_news, SplitRatio = .9)
newstrain = subset(sample_news , sample == TRUE)
newstest = subset(sample_news, sample == FALSE)

#10% of remaining train for dev 
#twitter
sample = sample.split(twittertrain, SplitRatio = .9)
twittertrain = subset(twittertrain , sample == TRUE)
twitterdev = subset(twittertrain, sample == FALSE)
#blogs
sample = sample.split(blogstrain, SplitRatio = .9)
blogstrain = subset(blogstrain , sample == TRUE)
blogsdev = subset(blogstrain, sample == FALSE)
#news
sample = sample.split(newstrain, SplitRatio = .9)
newstrain = subset(newstrain , sample == TRUE)
newsdev = subset(newstrain, sample == FALSE)

Below are the number of lines of the 4 resulting sample data sets: sample, train, test and dev:

##    source sample train  dev test
## 1 twitter  46900 37989 4221 4690
## 2   blogs  27128 21973 2442 2713
## 3    news   3880  3142  350  388

From now on only the train data set is used to analyse the data and build later on the predictive algorithm on. Test and Dev is left out for cross validation and final testing on new data.

3) Generating Corpus and Cleaning Data

The code below does the following data cleaning:

Numbers, graphs and punctuation are removed
Conversion to lowercase is done
White-spaces are stripped out
Swearwords are removed (http://www.bannedwordlist.com/lists/swearWords.txt)

Additional thoughts on cleaning:

Stopwords are not removed as those are needed as part of the prediction
Spell checking is not done to remove typos
Should I’m be handled like I am or you’re like you are?
Foreign language words might need to be detected, but leaving them as the number does not seem to be high

# reading swearwords from http://www.bannedwordlist.com/lists/swearWords.txt
toberemoved <- readLines("./other/swearWords.txt", encoding="UTF-8")

# functions for cleaning the corpus
removeGraphs <- function(x) {
    str_replace_all(x,"[^[:graph:]]", " ")  
}

cleanup <- function(x) {
    clean <- tm_map(x , removeNumbers)
    clean <- tm_map(clean, removeGraphs)
    clean <- tm_map(clean, iconv,  to="ASCII", sub=" ") 
    clean <- tm_map(clean, removePunctuation)
    clean <- tm_map(clean, tolower)
    clean <- tm_map(clean, removeWords, toberemoved)
    clean <- tm_map(clean , stripWhitespace)
    clean <- tm_map(clean, PlainTextDocument)
}

The final training corpus called “clean” consists of the 3 documents and has a size of 11.8 MB. This corpus is generated from the training data set only. This has been saved after sampling into directory train/formodel.

#creating train corpus
data <- VCorpus(DirSource("./train/formodel", encoding = "UTF-8"),
                            readerControl = list(language = "en")) # 11.8 Mb
# cleaning train corpus
clean <- cleanup(data) # see function above

4) Generating Unigrams to Analyse Single Word Frequency

word_matrix <- DocumentTermMatrix(clean, control=list(wordLengths=c(1,Inf)))
wordFreq <- colSums(as.matrix(word_matrix))
wf <- data.frame(word = names(wordFreq), freq = wordFreq)
wf <- wf[order(-wf[,2]),]

##   word  freq
## 1  the 66624
## 2   to 41902
## 3  and 36254
## 4    a 34255
## 5    i 30970
## 6   of 29597

The analysis and histogram shows that there are a few words with a very high frequency:
the = 66624, to = 41902, and = 36254.
38819 words out of 69915 unique words just appear once. 60,582 do have a frequency below 10.
To cover 50% of all words included in this set, just 120 unique words are needed. To cover 90%, 6921 unique words are needed.

5) Generating Bigrams and Trigrams to Analyse The Term Frequency

From the corpus now bigrams (2 word sequences) and trigrams (3 word sequences) are generated using NGramTokenizer from RWeka Package.

# function for rowsum, dataframe and sorting
processing <- function(x){
    token <- rowSums(as.matrix(x))
    token <- data.frame(term = names(token), freq = token)
    token <- token[order(-token[,2]),] 
}

options(mc.cores=1) 
# Bigrams 
BigramToken <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bitoken <- TermDocumentMatrix(clean, control = list(tokenize = BigramToken))
bf <- processing(bitoken)
#Trigrams
TrigramToken <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tritoken <- TermDocumentMatrix(clean, control = list(tokenize = TrigramToken))
tf <- processing(tritoken)

Looking at the histograms below for the bigrams, a few terms exist very frequent like ‘of the’ and ‘in the’ with over 5000. Further analysis gives that the total number of unique bigram phrases is around 60000 and out of those around 45000 (79%) only exist once.

The trigram distribution gives a similar picture: We have phrases “one of the” and “a lot of” that appear more than 400 times. The total number of unique terms is around 1000000, where around 990000 only exist one-time (92%).

Further I also analysed fourgrams as well where 97% of all phrases have a frequency of 1. The corresponding plot is not included in this report.

##      term freq
## 1  of the 6108
## 2  in the 5579
## 3  to the 3120
## 4  on the 2849
## 5 for the 2736
## 6   to be 2528

##             term freq
## 1     one of the  476
## 2       a lot of  437
## 3 thanks for the  365
## 4      i want to  287
## 5    going to be  256
## 6        to be a  250

6) Next Steps Towards the Prediction Algorithm and Shiny App

Based on the above analysis on word and phrase frequency distributions, my first model approach will uses only the trigrams and bigrams and finally the 3 most common words: the, to, and.
Those are build together using a “Stupid Backoff Model”. Following the documentation, this simple and inexpensive model seem to give quite good prediction results.
Testing need to be done on whether a better algorithm could be implemented to deal with “unseen” phrases than just backing off to a lower n-gram. Smoothing and/or Interpolation methods could be considered here.

The final Shiny Application will have a friendly user interface, where the user can type a phrase and gets 3 words suggested. Those can be selected and added by clicking them.

References: