Capstone Milestone Report

Introduction

In this report we discuss the preliminary steps in building a text prediction model. We are given three data sets consisting of blog posts, news articles and tweets. After being sampled and read into memory, a text corpus is generated from the sampled data. Thereafter, we consider the unigram, bigram and trigram frequencies. Lastly, we discuss the next steps towards building a language model.

Raw Data

The raw text data provided are large files composed of blog posts, news articles and tweets. A sampled subset from each file is read into memory and formated.

textTwitter <- readLines("./data/final/en_US/en_US.twitter.txt",encoding="UTF-8",n=5000, skipNul=TRUE)
textBlogs   <- readLines("./data/final/en_US/en_US.blogs.txt",encoding="UTF-8",n=5000, skipNul=TRUE)
textNews    <- readLines("./data/final/en_US/en_US.news.txt",encoding="UTF-8",n=5000, skipNul=TRUE)

textTwitter <-iconv(textTwitter, from="UTF-8", to="latin1", sub=" ")
textBlogs   <-iconv(textBlogs, from="UTF-8", to="latin1", sub=" ")
textNews    <-iconv(textNews, from="UTF-8", to="latin1", sub=" ")

Below we consider some basic attributes of our raw data.

File Name	File Size (MB)	Description	No. of Characters Sampled	No. of Words Sampled	No. of Sentences Sampled
en_US.blogs.txt	210.16	Blog posts	1,148,089	215,696	12,372
en_US.news.txt	205.81	News articles	1,015,886	183,014	10,837
en_US.twitter.txt	167.11	Twitter tweets	340,766	68,832	904

Corpus Creation and Proccessing

The sampled data is merged to create a corpus.

library(tm)
docs<-VCorpus(VectorSource(paste(textBlogs,textNews,textTwitter)))

Thereafter, we clean and process the corpus to make ready for further analysis. For example, the text is converted to lowercase and swear words, punctuation and numbers are removed.

toSpace<-content_transformer(function(x,pattern) gsub(pattern," ",x))
toOther<-content_transformer(function(x,pattern,y) gsub(pattern,y,x))

docs<-tm_map(docs,content_transformer(tolower))

docs<-tm_map(docs,toSpace,"/|@|\\|")

docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs, removePunctuation)
docs<-tm_map(docs, removeWords,stopwords("english"))
docs<-tm_map(docs, stripWhitespace)


swearWords <-readLines("./data/final/en_US/swearWords.txt",encoding="UTF-8")#read in list of swear words

docs<-tm_map(docs, removeWords,swearWords)

docs <- tm_map(docs, stemDocument)

docs<-tm_map(docs,toOther,"don t","dont")
docs<-tm_map(docs,toOther,"didn t","didnt")
docs<-tm_map(docs,toOther,"can t","cant")
docs<-tm_map(docs,toOther,"isn t","isnt")
docs<-tm_map(docs,toOther,"won t","wont")
docs<-tm_map(docs,toOther,"let s","lets")
docs<-tm_map(docs,toOther,"doesn t","doesnt")
docs<-tm_map(docs,toOther,"u u u","")

Tokenization

Having processed our corpus we move to break out words into desired chunks. For example, tokenizing into unigrams, bigrams and trigram will cluster the data into sets of one word, two words and three words, respectively.

library(RWeka)

tokenizer.u <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tokenizer.b <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tokenizer.t <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

options(mc.cores=1) 

dtm.u<-DocumentTermMatrix(docs, control = list(tokenize = tokenizer.u))
dtm.b<-DocumentTermMatrix(docs, control = list(tokenize = tokenizer.b))
dtm.t<-DocumentTermMatrix(docs, control = list(tokenize = tokenizer.t))

After tokenizing, we compute and plot the frequencies of unigrams, bigrams and trigrams.

freq.u<-sort(colSums(as.matrix(dtm.u)),decreasing = T)
freq.b<-sort(colSums(as.matrix(dtm.b)),decreasing = T)
freq.t<-sort(colSums(as.matrix(dtm.t)),decreasing = T)

wf.u<-data.frame(words=names(freq.u),freq=freq.u)
wf.b<-data.frame(words=names(freq.b),freq=freq.b)
wf.t<-data.frame(words=names(freq.t),freq=freq.t)

wf.u<-data.frame(words=names(freq.u),freq=freq.u)
wf.b<-data.frame(words=names(freq.b),freq=freq.b)
wf.t<-data.frame(words=names(freq.t),freq=freq.t)

Modeling

Theoretically we can think of the probability of a next word in the context of its probability respective to all the words in the sentence leading up to that next word.

\[P(word_i|word_1 word_2...word_{i-1})\]

However, this approach can be is too computationally expensive. A suitable solution is to implement a Markov assumption in which the probability is placed in the context of the last k number of words. For example, for a trigram model we can assume the below.

\[P(word_i|word_1 word_2...word_{i-1})\approx P(word_i|word_{i-2} word_{i-1})\]

In computing the relative n-gram model probabilities, we can utilize the Maximum Likelihood Estimation (MLE). In the case of a trigram model, we would compute as follows:

\[P(word_3|word_1 word_2)=\frac{Count(word_1,word_2,word_3)}{Count(word_1,word_2)}\]

Thereafter, we’ll have to contend with rare/unseen words and spell correction.