The purpose is to implement an algorithm that predict an upcoming word given the first words of a sentence.
The algorithm will predict a list of potential words based on a corpus.
This corpus is built with 3 inputs files.
The contents come from twitter, blogs or news.
The files are huge :
A first step is loading the raw data and count the words and the lines.
This count is made with raw data.
fileTwitter <- "D:/Rprog/11 Capstone Project/final/en_US/en_US.twitter.txt"
fileNews <- "D:/Rprog/11 Capstone Project/final/en_US/en_US.news.txt"
fileBlogs <- "D:/Rprog/11 Capstone Project/final/en_US/en_US.blogs.txt"
conn <- file(fileTwitter,open="r")
twitter <- readLines(conn)
close(conn)
conn <- file(fileNews,open="r")
news <- readLines(conn)
close(conn)
conn <- file(fileBlogs,open="r")
blogs <- readLines(conn)
close(conn)
grossWordCount <- c(sum(str_count(twitter,"\\s+")),sum(str_count(news,"\\s+")),sum(str_count(blogs,"\\s+")))
grossLineCount <- c(length(twitter),length(news),length(blogs))
rawDataSummary <- data.frame(grossWordCount,grossLineCount)
row.names(rawDataSummary) <- c("twitter","news","blogs")
kable(rawDataSummary)
| grossWordCount | grossLineCount | |
|---|---|---|
| 28013847 | 2360148 | |
| news | 2566713 | 77259 |
| blogs | 36435162 | 899288 |
data <- preprocess(concatenate(twitter,news,blogs),remove.punct = TRUE,fix.spacing = TRUE)
words = strsplit(data, " ", fixed = T)[[1]]
freqs = as.data.frame(table(words))
freqs <- freqs[with(freqs, order(-Freq)),]
ggplot(head(freqs,25), aes(x=reorder(words, Freq), y=Freq)) +
geom_bar(stat="identity", fill="#00308f") +
coord_flip() +
theme_light(base_size=18) +
xlab("") +
ylab("The 25 most frequent words ") +
theme(plot.title=element_text(size=18))
At the first sight, the model are going to use the chain rule of probability.
The probalility of a given sentence is : \(P(word_{1}word_{2} ... word_{n-1})=P(word_{1}) \times P(word_{2}|word_{1}) \times ... \times P(word_n|word_{1}word_{2} ... word_{n-1})\)
The problem is to estimate the probability of a word given the (n-1) previous words
\(P(word_n|word_{1}word_{2} ... word_{n-1})\)
At this step, the problem can be simplified with the Markow assumption :
The probality of a world depends only on a the k previous world.
We can estimate the probability of a word given the (n-1) previous word, by the probability of a word given the (k = 1, 2 ,3 < n-1 ) previous word
\(P(word_n|word_{1}word_{2} ... word_{n-1}) is estimate by P(word_n|word_{1}word_{2} ... word_{k})\)
We can introduce n-grams :
A n-gram is a contiguous sequence of n items from a given sequence of text or speech.
With 2-grams \(P(word_n|word_{1}word_{2} ... word_{n-1}) \equiv P(word_n|word_{n-1})\) and \(P(word_n|word_{n-1}) = \frac{NumberOf2Grams(word_{n-1}word_{n})}{NumberOf2Grams(word_{n-1}word_{any})}\)
With 3-grams \(P(word_n|word_{1}word_{2} ... word_{n-1}) \equiv P(word_n|word_{n-2}word_{n-1})\) and \(P(word_n|word_{n-2}word_{n-1}) = \frac{NumberOf3Grams(word_{n-2}word_{n-1}word_{n})}{NumberOf3Grams(word_{n-2}word_{n-1}word_{any})}\)
At this point, questions remain :