Introduction

This document is a milestone report for the Coursera Data Science Capstone project. Goal of the project is to build a model that predicts the next word(s) based on the previously inputs and eventually create a demonstrating application.

At this stage, a courpus dataset from Swiftkey has been loaded and explored. Further plans about the prediction algorithm and product are pictured in the way that a non-data scientist manager could appreciate.

Load the dataset

The given date at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip has been downloaded and unzipped to the working directory. 4 languages of parallel file collections are provided. Here I only focus on the “en_US” files.

list.files("final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
file.size(paste0("final/en_US/",list.files("final/en_US")))
## [1] 210160014 205811889 167105338

There are 3 text files containing blogs, news and twitter collection. Each of them has around 200M byte size, which is good enough to be loaded fully in memory.

library(readr)
blogs <- read_lines("final/en_US/en_US.blogs.txt")
news <- read_lines("final/en_US/en_US.news.txt")
twitter <- read_lines("final/en_US/en_US.twitter.txt")

Here are the summary(count of char and word) of each file:

library(stringi)
stri_stats_latex(blogs)
##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##     162464653             9      42636700      37570839             3 
##        Envirs 
##             0
stri_stats_latex(news)
##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##     162227130             2      40263955      34494539             1 
##        Envirs 
##             0
stri_stats_latex(twitter)
##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##     125570616          3032      35958481      30451128           963 
##        Envirs 
##             0

Data cleaning

To speed up the analysis, a subset of corpus is created.

library(NLP)
library(tm)
blogs <- read_lines("final/en_US/en_US.blogs.txt", n_max = 10000)
news <- read_lines("final/en_US/en_US.news.txt", n_max = 10000)
twitter <- read_lines("final/en_US/en_US.twitter.txt", n_max = 10000)
mycorpus <- VCorpus(VectorSource(paste(blogs, news, twitter)))

Several standard text cleaning, like lowercasing all word, removing numbers, punctuation and extra white spces etc is done with the tools from tm package:

mycorpus <- tm_map(mycorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
mycorpus <- tm_map(mycorpus, content_transformer(tolower))
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, removeNumbers)
mycorpus <- tm_map(mycorpus, stripWhitespace)

Other typical text cleaning procedures, such as removing stop word, and stemming, which I don’t think they should be done for the purpose of typing assistant. They are only need to be done when examining the meaning of text.

N-Gram analysis

The top freqency words are listed as follow:

dtm <- DocumentTermMatrix(mycorpus)
findFreqTerms(dtm, 5000)
## [1] "and"  "for"  "that" "the"  "was"  "with" "you"

But the next word prediction is not based on single word but words combination. So the N-Gram analysis is introduced.

Here we build a Bi-gram list from the subset of corpus, and plot top 20 ones.

library(slam)
library(RWeka)
getN2 <- function(x){NGramTokenizer(x, Weka_control(min=2, max=2))}
dtmN2 <- DocumentTermMatrix(mycorpus, control = list(tokenize=getN2))
dtmN2list <- col_sums(dtmN2)
dtmN2list <- sort(dtmN2list, decreasing = TRUE)
barplot(head(dtmN2list,20), 
        horiz = TRUE, las = 1, 
        xlab = "Count",
        main = "Top 20 Bi-grams")

Plot the whole bi-gram frequency distribution, we could see that the frequent used word combination is acturally quit few Zipf’s law. 80.7% of bi-grams occurs only once.

plot(dtmN2list)

length(dtmN2list)
## [1] 417317
length(dtmN2list[dtmN2list==1])
## [1] 336913

Similar analysis is done for tri-grams. Here we visualise with word cloud.

getN3 <- function(x){NGramTokenizer(x, Weka_control(min=3, max=3))}
dtmN3 <- DocumentTermMatrix(mycorpus, control = list(tokenize=getN3))
dtmN3list <- col_sums(dtmN3)
dtmN3list <- sort(dtmN3list, decreasing = TRUE)
library(RColorBrewer)
library(wordcloud)
wordcloud(names(dtmN3list), freq = dtmN3list,
          max.words = 30, random.order = F,
          rot.per = 0.3, colors = brewer.pal(12, "Set3"))

Prediction strategy and plans for the product

The prediction strategy will be finalised in the next few weeks. Based on the N-Grams analysis, the predictive algorithm could be built upon the smoothed tri-grams and bi-grams result.

To let the final application as smaller footprint as possible, only high freqency n-grams will remain at the first step. The smoothing method will be test afterward.

The user imput is considered to be learned automatically. For example, if user imput his full name twice, then his last name will be predicted when he input his first name again.

The application will use Shiny framework. It will get user’s input words but only the last 2 words will be taken into account. When press Enter key, 5 (at maximunm) words will display ordered by estimated possibility. I will try to code with base R for better performance.

A test corpus will be prepared to evaluate different model configuration.