The goal of this analysis is to explore and understand the data from 3 different text sources: Twitter, blogs and news, to be able to have and idea of an algorithm for predict the next word of a text.
We will process the data using some common techniques used in Natural Language Processing to achieve a better model and data summary.
The steps we will use will be:
Since our prediction analysis will be based on the Markov chain, we will get frequencies of combined words using N-grams.
After load the raw data, we can see that the 3 sources have diferent sizes:
Twitter: 2360148 lines
Blog: 899288 lines
News: 1010242 lines
Since our goal here is just to explore the data and this data is too big, ~4 millions of lines, and take a very long time to process, we will work with just a 5% subset of the data. This way we can get a good feel of the data and have ideas for our algorithm in faster time.
Our subset of the data look like:
Twitter: 118007 lines
Blog: 44964 lines
News: 50512 lines
After do the data process described above, let’s take a look at the most single common words from each source:
Numbers of distinct sparse words: 410
Most common words:
Numbers of distinct sparse words: 1188
Most common words:
Numbers of distinct sparse words: 1177
Most common words:
And now, let’s see them combined:
Numbers of distinct sparse words: 713
Most common words:
And since this will be our working dataset, let’s plot the top 25 frequencies of 2 and 3 combined words(n-grams).
Numbers of distinct sparse words: 351
Most common Bi-grams:
Numbers of distinct sparse words: 129
Most common Tri-grams:
For predictions, the algorithm will be based on the Markov Chain assumption.
Our goal is to predict the probability of an upcoming word based on the previous words as: P(w5|w1,w2,w3,w4)
Since we will never have enought data for estimating a whole sentence, Markov assumption simplify it to use just the last words of the sentence, as P(w5|w4) or maybe P(w5|w3,w4), P(w5|w2,w3,w4) and so on.
So, I will construct an algorithm that based on the last word(unigram) or last-n words(n-grams), will try to predict the next one.
For a detailed explanation with all the math included, you can refer to Stanford University - Natural Language Processing
http://CRAN.R-project.org/package=tm
http://ggplot2.org/
https://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf
https://en.wikipedia.org/wiki/Stemming
https://en.wikipedia.org/wiki/N-gram
https://en.wikipedia.org/wiki/Markov_chain
http://spark-public.s3.amazonaws.com/nlp/slides/languagemodeling.pdf
http://onepager.togaware.com/TextMiningO.pdf