Data Science Capstone - Exploratory Analysis

Synopsis

The goal of this analysis is to explore and understand the data from 3 different text sources: Twitter, blogs and news, to be able to have and idea of an algorithm for predict the next word of a text.

Data Process

We will process the data using some common techniques used in Natural Language Processing to achieve a better model and data summary.

The steps we will use will be:

Convert all words to lowercase
Remove punctuation
Remove profanity words using this data source
Remove numbers
Stemming the words
StripWhiteSpace
Remove Sparse terms

Since our prediction analysis will be based on the Markov chain, we will get frequencies of combined words using N-grams.

Data Analysis

After load the raw data, we can see that the 3 sources have diferent sizes:

Twitter: 2360148 lines
Blog: 899288 lines
News: 1010242 lines

Since our goal here is just to explore the data and this data is too big, ~4 millions of lines, and take a very long time to process, we will work with just a 5% subset of the data. This way we can get a good feel of the data and have ideas for our algorithm in faster time.

Our subset of the data look like:

Twitter: 118007 lines
Blog: 44964 lines
News: 50512 lines

After do the data process described above, let’s take a look at the most single common words from each source:

Twitter

Numbers of distinct sparse words: 410

Most common words:

Blogs

Numbers of distinct sparse words: 1188

Most common words:

News

Numbers of distinct sparse words: 1177

Most common words:

All Sources Combined

And now, let’s see them combined:

Numbers of distinct sparse words: 713

Most common words:

And since this will be our working dataset, let’s plot the top 25 frequencies of 2 and 3 combined words(n-grams).

Numbers of distinct sparse words: 351

Most common Bi-grams:

Numbers of distinct sparse words: 129

Most common Tri-grams:

Simplified Prediction Algorithm Plan

For predictions, the algorithm will be based on the Markov Chain assumption.

Our goal is to predict the probability of an upcoming word based on the previous words as: P(w5|w1,w2,w3,w4)

Since we will never have enought data for estimating a whole sentence, Markov assumption simplify it to use just the last words of the sentence, as P(w5|w4) or maybe P(w5|w3,w4), P(w5|w2,w3,w4) and so on.

So, I will construct an algorithm that based on the last word(unigram) or last-n words(n-grams), will try to predict the next one.

For a detailed explanation with all the math included, you can refer to Stanford University - Natural Language Processing

References

http://CRAN.R-project.org/package=tm
http://ggplot2.org/
https://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf
https://en.wikipedia.org/wiki/Stemming
https://en.wikipedia.org/wiki/N-gram
https://en.wikipedia.org/wiki/Markov_chain
http://spark-public.s3.amazonaws.com/nlp/slides/languagemodeling.pdf
http://onepager.togaware.com/TextMiningO.pdf