Introduction

In this report we discuss the preliminary steps in building a text prediction model. We are given three data sets consisting of blog posts, news articles and tweets. After being sampled and read into memory, a text corpus is generated from the sampled data. Thereafter, we consider the unigram, bigram and trigram frequencies. Lastly, we discuss the next steps towards building a language model.

Raw Data

The raw text data provided are large files composed of blog posts, news articles and tweets. A sampled subset from each file is read into memory and formated.

textTwitter <- readLines("./data/final/en_US/en_US.twitter.txt",encoding="UTF-8",n=5000, skipNul=TRUE)
textBlogs   <- readLines("./data/final/en_US/en_US.blogs.txt",encoding="UTF-8",n=5000, skipNul=TRUE)
textNews    <- readLines("./data/final/en_US/en_US.news.txt",encoding="UTF-8",n=5000, skipNul=TRUE)

textTwitter <-iconv(textTwitter, from="UTF-8", to="latin1", sub=" ")
textBlogs   <-iconv(textBlogs, from="UTF-8", to="latin1", sub=" ")
textNews    <-iconv(textNews, from="UTF-8", to="latin1", sub=" ")

Below we consider some basic attributes of our raw data.

File Name File Size (MB) Description No. of Characters Sampled No. of Words Sampled No. of Sentences Sampled
en_US.blogs.txt 210.16 Blog posts 1,148,089 215,696 12,372
en_US.news.txt 205.81 News articles 1,015,886 183,014 10,837
en_US.twitter.txt 167.11 Twitter tweets 340,766 68,832 904

Corpus Creation and Proccessing

The sampled data is merged to create a corpus.

library(tm)
docs<-VCorpus(VectorSource(paste(textBlogs,textNews,textTwitter)))

Thereafter, we clean and process the corpus to make ready for further analysis. For example, the text is converted to lowercase and swear words, punctuation and numbers are removed.

toSpace<-content_transformer(function(x,pattern) gsub(pattern," ",x))
toOther<-content_transformer(function(x,pattern,y) gsub(pattern,y,x))

docs<-tm_map(docs,content_transformer(tolower))

docs<-tm_map(docs,toSpace,"/|@|\\|")

docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs, removePunctuation)
docs<-tm_map(docs, removeWords,stopwords("english"))
docs<-tm_map(docs, stripWhitespace)


swearWords <-readLines("./data/final/en_US/swearWords.txt",encoding="UTF-8")#read in list of swear words

docs<-tm_map(docs, removeWords,swearWords)

docs <- tm_map(docs, stemDocument)

docs<-tm_map(docs,toOther,"don t","dont")
docs<-tm_map(docs,toOther,"didn t","didnt")
docs<-tm_map(docs,toOther,"can t","cant")
docs<-tm_map(docs,toOther,"isn t","isnt")
docs<-tm_map(docs,toOther,"won t","wont")
docs<-tm_map(docs,toOther,"let s","lets")
docs<-tm_map(docs,toOther,"doesn t","doesnt")
docs<-tm_map(docs,toOther,"u u u","")

Tokenization

Having processed our corpus we move to break out words into desired chunks. For example, tokenizing into unigrams, bigrams and trigram will cluster the data into sets of one word, two words and three words, respectively.

library(RWeka)

tokenizer.u <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tokenizer.b <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tokenizer.t <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

options(mc.cores=1) 

dtm.u<-DocumentTermMatrix(docs, control = list(tokenize = tokenizer.u))
dtm.b<-DocumentTermMatrix(docs, control = list(tokenize = tokenizer.b))
dtm.t<-DocumentTermMatrix(docs, control = list(tokenize = tokenizer.t))

After tokenizing, we compute and plot the frequencies of unigrams, bigrams and trigrams.

freq.u<-sort(colSums(as.matrix(dtm.u)),decreasing = T)
freq.b<-sort(colSums(as.matrix(dtm.b)),decreasing = T)
freq.t<-sort(colSums(as.matrix(dtm.t)),decreasing = T)

wf.u<-data.frame(words=names(freq.u),freq=freq.u)
wf.b<-data.frame(words=names(freq.b),freq=freq.b)
wf.t<-data.frame(words=names(freq.t),freq=freq.t)
wf.u<-data.frame(words=names(freq.u),freq=freq.u)
wf.b<-data.frame(words=names(freq.b),freq=freq.b)
wf.t<-data.frame(words=names(freq.t),freq=freq.t)

Modeling

Theoretically we can think of the probability of a next word in the context of its probability respective to all the words in the sentence leading up to that next word.

\[P(word_i|word_1 word_2...word_{i-1})\]

However, this approach can be is too computationally expensive. A suitable solution is to implement a Markov assumption in which the probability is placed in the context of the last k number of words. For example, for a trigram model we can assume the below.

\[P(word_i|word_1 word_2...word_{i-1})\approx P(word_i|word_{i-2} word_{i-1})\]

In computing the relative n-gram model probabilities, we can utilize the Maximum Likelihood Estimation (MLE). In the case of a trigram model, we would compute as follows:

\[P(word_3|word_1 word_2)=\frac{Count(word_1,word_2,word_3)}{Count(word_1,word_2)}\]

Thereafter, we’ll have to contend with rare/unseen words and spell correction.