In this report we discuss the preliminary steps in building a text prediction model. We are given three data sets consisting of blog posts, news articles and tweets. After being sampled and read into memory, a text corpus is generated from the sampled data. Thereafter, we consider the unigram, bigram and trigram frequencies. Lastly, we discuss the next steps towards building a language model.
The raw text data provided are large files composed of blog posts, news articles and tweets. A sampled subset from each file is read into memory and formated.
textTwitter <- readLines("./data/final/en_US/en_US.twitter.txt",encoding="UTF-8",n=5000, skipNul=TRUE)
textBlogs <- readLines("./data/final/en_US/en_US.blogs.txt",encoding="UTF-8",n=5000, skipNul=TRUE)
textNews <- readLines("./data/final/en_US/en_US.news.txt",encoding="UTF-8",n=5000, skipNul=TRUE)
textTwitter <-iconv(textTwitter, from="UTF-8", to="latin1", sub=" ")
textBlogs <-iconv(textBlogs, from="UTF-8", to="latin1", sub=" ")
textNews <-iconv(textNews, from="UTF-8", to="latin1", sub=" ")
Below we consider some basic attributes of our raw data.
| File Name | File Size (MB) | Description | No. of Characters Sampled | No. of Words Sampled | No. of Sentences Sampled |
|---|---|---|---|---|---|
| en_US.blogs.txt | 210.16 | Blog posts | 1,148,089 | 215,696 | 12,372 |
| en_US.news.txt | 205.81 | News articles | 1,015,886 | 183,014 | 10,837 |
| en_US.twitter.txt | 167.11 | Twitter tweets | 340,766 | 68,832 | 904 |
The sampled data is merged to create a corpus.
library(tm)
docs<-VCorpus(VectorSource(paste(textBlogs,textNews,textTwitter)))
Thereafter, we clean and process the corpus to make ready for further analysis. For example, the text is converted to lowercase and swear words, punctuation and numbers are removed.
toSpace<-content_transformer(function(x,pattern) gsub(pattern," ",x))
toOther<-content_transformer(function(x,pattern,y) gsub(pattern,y,x))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,toSpace,"/|@|\\|")
docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs, removePunctuation)
docs<-tm_map(docs, removeWords,stopwords("english"))
docs<-tm_map(docs, stripWhitespace)
swearWords <-readLines("./data/final/en_US/swearWords.txt",encoding="UTF-8")#read in list of swear words
docs<-tm_map(docs, removeWords,swearWords)
docs <- tm_map(docs, stemDocument)
docs<-tm_map(docs,toOther,"don t","dont")
docs<-tm_map(docs,toOther,"didn t","didnt")
docs<-tm_map(docs,toOther,"can t","cant")
docs<-tm_map(docs,toOther,"isn t","isnt")
docs<-tm_map(docs,toOther,"won t","wont")
docs<-tm_map(docs,toOther,"let s","lets")
docs<-tm_map(docs,toOther,"doesn t","doesnt")
docs<-tm_map(docs,toOther,"u u u","")
Having processed our corpus we move to break out words into desired chunks. For example, tokenizing into unigrams, bigrams and trigram will cluster the data into sets of one word, two words and three words, respectively.
library(RWeka)
tokenizer.u <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tokenizer.b <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tokenizer.t <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
options(mc.cores=1)
dtm.u<-DocumentTermMatrix(docs, control = list(tokenize = tokenizer.u))
dtm.b<-DocumentTermMatrix(docs, control = list(tokenize = tokenizer.b))
dtm.t<-DocumentTermMatrix(docs, control = list(tokenize = tokenizer.t))
After tokenizing, we compute and plot the frequencies of unigrams, bigrams and trigrams.
freq.u<-sort(colSums(as.matrix(dtm.u)),decreasing = T)
freq.b<-sort(colSums(as.matrix(dtm.b)),decreasing = T)
freq.t<-sort(colSums(as.matrix(dtm.t)),decreasing = T)
wf.u<-data.frame(words=names(freq.u),freq=freq.u)
wf.b<-data.frame(words=names(freq.b),freq=freq.b)
wf.t<-data.frame(words=names(freq.t),freq=freq.t)
wf.u<-data.frame(words=names(freq.u),freq=freq.u)
wf.b<-data.frame(words=names(freq.b),freq=freq.b)
wf.t<-data.frame(words=names(freq.t),freq=freq.t)
Theoretically we can think of the probability of a next word in the context of its probability respective to all the words in the sentence leading up to that next word.
\[P(word_i|word_1 word_2...word_{i-1})\]
However, this approach can be is too computationally expensive. A suitable solution is to implement a Markov assumption in which the probability is placed in the context of the last k number of words. For example, for a trigram model we can assume the below.
\[P(word_i|word_1 word_2...word_{i-1})\approx P(word_i|word_{i-2} word_{i-1})\]
In computing the relative n-gram model probabilities, we can utilize the Maximum Likelihood Estimation (MLE). In the case of a trigram model, we would compute as follows:
\[P(word_3|word_1 word_2)=\frac{Count(word_1,word_2,word_3)}{Count(word_1,word_2)}\]
Thereafter, we’ll have to contend with rare/unseen words and spell correction.