Word Prediction - A first data analysis

Synopsis

The purpose is to implement an algorithm that predict an upcoming word given the first words of a sentence.
The algorithm will predict a list of potential words based on a corpus.
This corpus is built with 3 inputs files.
The contents come from twitter, blogs or news.

A first exploration of the corpus

The files are huge :

Twitter : 159 Mo
News : 196 Mo
Blogs : 200 Mo

A first step is loading the raw data and count the words and the lines.
This count is made with raw data.

fileTwitter <- "D:/Rprog/11 Capstone Project/final/en_US/en_US.twitter.txt"
fileNews <- "D:/Rprog/11 Capstone Project/final/en_US/en_US.news.txt"
fileBlogs <- "D:/Rprog/11 Capstone Project/final/en_US/en_US.blogs.txt"

conn <- file(fileTwitter,open="r")
twitter <- readLines(conn)
close(conn)

conn <- file(fileNews,open="r")
news <- readLines(conn)
close(conn)

conn <- file(fileBlogs,open="r")
blogs <- readLines(conn)
close(conn)

grossWordCount <- c(sum(str_count(twitter,"\\s+")),sum(str_count(news,"\\s+")),sum(str_count(blogs,"\\s+")))
grossLineCount <- c(length(twitter),length(news),length(blogs))
rawDataSummary <- data.frame(grossWordCount,grossLineCount)
row.names(rawDataSummary) <- c("twitter","news","blogs")
kable(rawDataSummary)

	grossWordCount	grossLineCount
twitter	28013847	2360148
news	2566713	77259
blogs	36435162	899288

Preprocess the raw data and count words frequencies with ngram package

data <- preprocess(concatenate(twitter,news,blogs),remove.punct = TRUE,fix.spacing = TRUE)
words = strsplit(data, " ", fixed = T)[[1]]
freqs = as.data.frame(table(words))
freqs <- freqs[with(freqs, order(-Freq)),]

ggplot(head(freqs,25), aes(x=reorder(words, Freq), y=Freq)) +
  geom_bar(stat="identity", fill="#00308f") +
  coord_flip() + 
  theme_light(base_size=18) +
  xlab("") +
  ylab("The 25 most frequent words ") + 
  theme(plot.title=element_text(size=18))

The model

At the first sight, the model are going to use the chain rule of probability.
The probalility of a given sentence is : \(P(word_{1}word_{2} ... word_{n-1})=P(word_{1}) \times P(word_{2}|word_{1}) \times ... \times P(word_n|word_{1}word_{2} ... word_{n-1})\)

The problem is to estimate the probability of a word given the (n-1) previous words
\(P(word_n|word_{1}word_{2} ... word_{n-1})\)

At this step, the problem can be simplified with the Markow assumption :

The probality of a world depends only on a the k previous world.

We can estimate the probability of a word given the (n-1) previous word, by the probability of a word given the (k = 1, 2 ,3 < n-1 ) previous word

\(P(word_n|word_{1}word_{2} ... word_{n-1}) is estimate by P(word_n|word_{1}word_{2} ... word_{k})\)

We can introduce n-grams :

A n-gram is a contiguous sequence of n items from a given sequence of text or speech.

With 2-grams \(P(word_n|word_{1}word_{2} ... word_{n-1}) \equiv P(word_n|word_{n-1})\) and \(P(word_n|word_{n-1}) = \frac{NumberOf2Grams(word_{n-1}word_{n})}{NumberOf2Grams(word_{n-1}word_{any})}\)

With 3-grams \(P(word_n|word_{1}word_{2} ... word_{n-1}) \equiv P(word_n|word_{n-2}word_{n-1})\) and \(P(word_n|word_{n-2}word_{n-1}) = \frac{NumberOf3Grams(word_{n-2}word_{n-1}word_{n})}{NumberOf3Grams(word_{n-2}word_{n-1}word_{any})}\)

Current issues

At this point, questions remain :

How to process so huge file ! My first attempts to process 2grams didn’t finish with “ngram” or “tm” packages, I had to kill the process.
In the model, what about the words that are unknown in the corpus ?
Is the corpus enough heterogeneous ? Is it representative of the natural language ? How to evaluate this bias ?
How to take in account in the algorithm the context, the sense of the phrase ? If the sentence is “I like swimming in the …” ,propose “garage” because “in the garage” is the most frequent 3-grams is not very helpful.