This is a milestone paper for Coursera/JHU project on building an algorythm to predict next word to be typed on a mobile application. The steps described in this paper are meant to explain to a person with limited technical background the following major steps leading towards building a real app:
- loading data into R
- building Corpus
- exploratory analysis of the training data set.
Feedback from fellow students is highly appreciated as well!
The training data for building the model is from HC Corpora http://www.corpora.heliohost.org. The data for Capstone Project is downloadable as ‘.zip’ file from this location. The zipped file has been downloaded and unzipped into R’s default working directory, so that the three English language files of interest can be accessed from the following locations:
- "./final/en_US/en_US.twitter.txt"
- "./final/en_US/en_US.news.txt"
- "./final/en_US/en_US.blogs.txt"
Before reading data into R let’s first check encodings of the downloaded data:
# check file encodings
library(tau)
is.utf8("./final/en_US/en_US.twitter.txt")
is.ascii("./final/en_US/en_US.twitter.txt")
As it turns out the data is in ASCII encoding with many unicode charecters assumed not very useful for text prediction and thus to be cleaned. After reading data into R, I clean non UTF-8 charecters
# read data into R
require(knitr)
twitter <- readLines("./final/en_US/en_US.twitter.txt", skipNul = TRUE)
blog <- readLines("./final/en_US/en_US.blogs.txt" , skipNul = TRUE)
news <- readLines("./final/en_US/en_US.news.txt" , skipNul = TRUE)
closeAllConnections()
# leave alphanumerics, punctuation and spaces only
tw <- gsub("[^[:print:]]", " ", twitter)
bl <- gsub("[^[:print:]]", " ", blog)
ns <- gsub("[^[:print:]]", " ", news)
As the data seems to be lengthy, let’s reduce it via sampling 10-20% of the whole data set available. This should speed up process at model building stage
# Sample subset to build and explore model
set.seed(1)
twitterSmpl <- tw[as.logical(rbinom(length(tw), 1, prob=.1))]
newsSmpl <- ns[as.logical(rbinom(length(ns), 1, prob=.2))]
blogSmpl <- bl[as.logical(rbinom(length(bl), 1, prob=.1))]
Finally, to prepare for building Corpus, let’s concatenate lines to avoid treating any separate line as a separate document.
# Collapse into one document (to avoid numerous meta data)
library(qdap)
twitterOne <- paste2(twitterSmpl, " ")
newsOne <- paste2(newsSmpl, " ")
blogOne <- paste2(blogSmpl, " ")
A standard way to manipulate unstructerd data like text is to load it into Corpus. In R this is done with the help of VCorpus function from tm() package:
library(tm)
options(mc.cores=1) # to make tm package running more robust
rawCorpus <- VCorpus(VectorSource(list(twitterOne, newsOne, blogOne)),
readerControl = list(language="english"))
To prepare for tokenization, let’s clean rawCorpus with facilities available in tm() package:
- transform to lower case
- stem (questionable, need to be tested further at model tuning)
- remove numbers and
- strip white spaces
At this stage I decided not to remove any stop words.
# Clean Corpus
library(SnowballC)
clean <- function(y) {
rawCorpus <- tm_map(y, content_transformer(tolower))
rawCorpus <- tm_map(rawCorpus, stemDocument)
rawCorpus <- tm_map(rawCorpus, removeNumbers)
tm_map(rawCorpus, stripWhitespace)
}
cleanCorpus <- clean(rawCorpus)
rm(rawCorpus)
To explore data let’s tokenize our cleanCorpus. Tokenization in R’s tm() package is made via applying TermDocumentMatrix() function. The result will be a [sparse] matrix with all the unique words recorded as row names and # of words mentions recorded in columns.
tdm <- TermDocumentMatrix(cleanCorpus)
First elements of the matrix show that data sample was perhaps under-cleaned:
> inspect(tdm[1:10,1:3])
Non-/sparse entries: 13/17
Sparsity : 57%
Maximal term length: 4
Weighting : term frequency (tf)
Docs
Terms 1 2 3
`_` 1 0 0
^^^ 1 0 0
^^^^ 1 0 0
^^;; 2 0 1
^^. 1 0 1
^^" 0 0 1
^~^ 1 0 0
^~) 1 0 0
^_^ 84 0 7
^_^, 1 0 0
"
Words, that are emcountered 20’000 times or more:
> findFreqTerms(tdm, lowfreq=20000)
[1] "about" "all" "and" "are" "been" "but" "can" "for" "from"
[10] "get" "had" "has" "have" "her" "his" "how" "just" "like"
[19] "make" "more" "new" "not" "one" "our" "out" "said" "she"
[28] "some" "that" "the" "their" "there" "they" "this" "time" "was"
[37] "were" "what" "when" "who" "will" "with" "would" "you" "your"
The list of 20’000 most frequent words coinsides in most part with standard atop words which perhaps warrants revisiting cleaning step and removing stop words there.
Count frequencies of most common words:
> inspect(tdm[ind,])
Non-/sparse entries: 135/0
Sparsity : 0%
Maximal term length: 5
Weighting : term frequency (tf)
Docs
Terms 1 2 3
about 8852 17517 10911
all 11053 11831 13089
and 43039 173610 107731
are 15503 27101 18912
been 4228 13233 7556
but 12119 28654 19673
can 8571 11483 9703
for 38103 69576 36086
from 8117 30123 14719
get 14277 11753 9383
had 4166 16373 10495
has 4477 24031 9410
have 17900 30122 23617
her 3021 13042 9946
his 3299 30958 10746
how 7533 6735 6183
just 14921 10507 9889
like 12426 11379 10462
make 7174 10315 7893
more 5713 16731 8597
new 6763 13584 5322
not 11709 21391 16661
one 7266 15687 11720
our 6320 7507 8477
out 9494 12654 9275
said 1553 28341 2395
she 3448 15523 9412
some 6160 9983 8785
that 25364 70157 44993
the 92069 387003 185000
their 2758 17211 9151
there 5544 10528 8718
they 6727 21622 13889
this 15122 21925 24232
time 6246 9172 7651
was 11434 44848 27521
were 2539 14282 7636
what 12344 11343 10726
when 8021 14901 10290
who 5894 21992 8303
will 9236 22162 11572
with 17061 50387 28160
would 5111 14203 7996
you 47289 16842 27062
your 17252 6242 10133
Reduction in data size due to sampling (countered as # of lines):
# # of lines in full, raw data inputs, as downloaded from web
> sapply(list(twitter, blog, news), length)
[1] 2360148 899288 1010242
# # of lines in sampled data: 10%, 10%, 20% via rbinom()
> sapply(list(twitterSmpl, blogSmpl, newsSmpl), length)
[1] 235891 89610 201870