Synopsys

This is a milestone paper for Coursera/JHU project on building an algorythm to predict next word to be typed on a mobile application. The steps described in this paper are meant to explain to a person with limited technical background the following major steps leading towards building a real app:
- loading data into R
- building Corpus
- exploratory analysis of the training data set.
Feedback from fellow students is highly appreciated as well!

Download and preliminary clean data

The training data for building the model is from HC Corpora http://www.corpora.heliohost.org. The data for Capstone Project is downloadable as ‘.zip’ file from this location. The zipped file has been downloaded and unzipped into R’s default working directory, so that the three English language files of interest can be accessed from the following locations:
- "./final/en_US/en_US.twitter.txt"
- "./final/en_US/en_US.news.txt"
- "./final/en_US/en_US.blogs.txt"

Before reading data into R let’s first check encodings of the downloaded data:

# check file encodings
library(tau)
is.utf8("./final/en_US/en_US.twitter.txt")
is.ascii("./final/en_US/en_US.twitter.txt")

As it turns out the data is in ASCII encoding with many unicode charecters assumed not very useful for text prediction and thus to be cleaned. After reading data into R, I clean non UTF-8 charecters

# read data into R
require(knitr)
twitter <- readLines("./final/en_US/en_US.twitter.txt", skipNul = TRUE)
blog    <- readLines("./final/en_US/en_US.blogs.txt"  , skipNul = TRUE)
news    <- readLines("./final/en_US/en_US.news.txt"   , skipNul = TRUE)
closeAllConnections()

# leave alphanumerics, punctuation and spaces only
tw <- gsub("[^[:print:]]", "  ", twitter)
bl <- gsub("[^[:print:]]", "  ", blog)
ns <- gsub("[^[:print:]]", "  ", news)

As the data seems to be lengthy, let’s reduce it via sampling 10-20% of the whole data set available. This should speed up process at model building stage

# Sample subset to build and explore model
set.seed(1)
twitterSmpl <- tw[as.logical(rbinom(length(tw), 1, prob=.1))]
newsSmpl    <- ns[as.logical(rbinom(length(ns), 1, prob=.2))]
blogSmpl    <- bl[as.logical(rbinom(length(bl), 1, prob=.1))]

Finally, to prepare for building Corpus, let’s concatenate lines to avoid treating any separate line as a separate document.

# Collapse into one document (to avoid numerous meta data)
library(qdap)
twitterOne  <- paste2(twitterSmpl, " ")
newsOne     <- paste2(newsSmpl, " ")
blogOne     <- paste2(blogSmpl, " ")

Build Corpus

A standard way to manipulate unstructerd data like text is to load it into Corpus. In R this is done with the help of VCorpus function from tm() package:

library(tm)
options(mc.cores=1) # to make tm package running more robust
rawCorpus <- VCorpus(VectorSource(list(twitterOne, newsOne, blogOne)),
                     readerControl = list(language="english"))

To prepare for tokenization, let’s clean rawCorpus with facilities available in tm() package:
- transform to lower case
- stem (questionable, need to be tested further at model tuning)
- remove numbers and
- strip white spaces

At this stage I decided not to remove any stop words.

# Clean Corpus
library(SnowballC)
clean <- function(y) {
        rawCorpus <- tm_map(y, content_transformer(tolower))
        rawCorpus <- tm_map(rawCorpus, stemDocument)
        rawCorpus <- tm_map(rawCorpus, removeNumbers)
        tm_map(rawCorpus, stripWhitespace)
}
cleanCorpus <- clean(rawCorpus)
rm(rawCorpus)

Explore data

To explore data let’s tokenize our cleanCorpus. Tokenization in R’s tm() package is made via applying TermDocumentMatrix() function. The result will be a [sparse] matrix with all the unique words recorded as row names and # of words mentions recorded in columns.

tdm <- TermDocumentMatrix(cleanCorpus)

First elements of the matrix show that data sample was perhaps under-cleaned:

> inspect(tdm[1:10,1:3])

Non-/sparse entries: 13/17
Sparsity           : 57%
Maximal term length: 4
Weighting          : term frequency (tf)

      Docs
Terms   1 2 3
  `_`   1 0 0
  ^^^   1 0 0
  ^^^^  1 0 0
  ^^;;  2 0 1
  ^^.   1 0 1
  ^^"   0 0 1
  ^~^   1 0 0
  ^~)   1 0 0
  ^_^  84 0 7
  ^_^,  1 0 0
"

Words, that are emcountered 20’000 times or more:

> findFreqTerms(tdm, lowfreq=20000)
 [1] "about" "all"   "and"   "are"   "been"  "but"   "can"   "for"   "from" 
[10] "get"   "had"   "has"   "have"  "her"   "his"   "how"   "just"  "like" 
[19] "make"  "more"  "new"   "not"   "one"   "our"   "out"   "said"  "she"  
[28] "some"  "that"  "the"   "their" "there" "they"  "this"  "time"  "was"  
[37] "were"  "what"  "when"  "who"   "will"  "with"  "would" "you"   "your" 

The list of 20’000 most frequent words coinsides in most part with standard atop words which perhaps warrants revisiting cleaning step and removing stop words there.

Count frequencies of most common words:

> inspect(tdm[ind,])

Non-/sparse entries: 135/0
Sparsity           : 0%
Maximal term length: 5
Weighting          : term frequency (tf)

       Docs
Terms       1      2      3
  about  8852  17517  10911
  all   11053  11831  13089
  and   43039 173610 107731
  are   15503  27101  18912
  been   4228  13233   7556
  but   12119  28654  19673
  can    8571  11483   9703
  for   38103  69576  36086
  from   8117  30123  14719
  get   14277  11753   9383
  had    4166  16373  10495
  has    4477  24031   9410
  have  17900  30122  23617
  her    3021  13042   9946
  his    3299  30958  10746
  how    7533   6735   6183
  just  14921  10507   9889
  like  12426  11379  10462
  make   7174  10315   7893
  more   5713  16731   8597
  new    6763  13584   5322
  not   11709  21391  16661
  one    7266  15687  11720
  our    6320   7507   8477
  out    9494  12654   9275
  said   1553  28341   2395
  she    3448  15523   9412
  some   6160   9983   8785
  that  25364  70157  44993
  the   92069 387003 185000
  their  2758  17211   9151
  there  5544  10528   8718
  they   6727  21622  13889
  this  15122  21925  24232
  time   6246   9172   7651
  was   11434  44848  27521
  were   2539  14282   7636
  what  12344  11343  10726
  when   8021  14901  10290
  who    5894  21992   8303
  will   9236  22162  11572
  with  17061  50387  28160
  would  5111  14203   7996
  you   47289  16842  27062
  your  17252   6242  10133

Reduction in data size due to sampling (countered as # of lines):

# # of lines in full, raw data inputs, as downloaded from web
> sapply(list(twitter, blog, news), length)
[1] 2360148  899288 1010242
# # of lines in sampled data: 10%, 10%, 20% via rbinom()
> sapply(list(twitterSmpl, blogSmpl, newsSmpl), length)
[1] 235891  89610 201870