Milestone report

Summary

This work summarizes the data for the final project in Data Science specialization. The report will present the content of the files, basic statistics concerning the text data and its graphical representation. The end section will also discuss the predictive algorithm I am going to deploy for the final application.

Data Processing and Summary Of the Data

The data provided for the project contains 4 different folders (based on used language - English, German, French, Russian), each of them is holding 3 text files. Those files represent text data from twitter, blogs and the news.

Folders:

## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

Content of the en_US folder:

##                                      size isdir
## ./final/en_US/en_US.blogs.txt   210160014 FALSE
## ./final/en_US/en_US.twitter.txt 167105323 FALSE
## ./final/en_US/en_US.news.txt    205811888 FALSE

Total line count:

length(twitter.vc)

## [1] 2360148

length(news.vc)

## [1] 1010242

length(blogs.vc)

## [1] 899288

Exploratory Data Analysis

To split data into words and n-grams I used NGramTokenizer() function provided by RWeka package. Overall word count would be as follows for each file:

length(twitter.words)

## [1] 18255256

length(news.words)

## [1] 20254884

length(blogs.words)

## [1] 20135119

Top five most frequent words in english files are as follows:

##   news.txt twitter.txt blogs.txt
## 1     said        just       one
## 2     will        like      will
## 3      one         get       can
## 4      new        love      just
## 5      can        good      like

For the further analysis I will start to use samples from the data due to limitation of my physical memory. Sample is then stored into Corpus - variable that contain the text and associated with it metadata.

Another interesting thing to explore is association and dependence between words. For this reason we need to build Document-Term Matrix (DTM), which represent a table where rows (Documents) are available texts and columns (Terms) - either words or expressions (n-grams) that have been encountered in those documents. Good functionality to work with DTMs can be found in tm package.

For instance, if we take one of the most frequent words in the twitter.txt data, we can find words and phrases most associated with it (based on correlation among frequencies):

##            love
## i lov      0.52
## love you   0.31
## love it    0.28
## love to    0.27
## love and   0.25
## a lov      0.24
## love ar    0.24
## love th    0.23
## love lov   0.22
## i love it  0.20
## i love you 0.20
## in lov     0.20

We can also look at associations more generally using graph representation of the link between words:

Moving away from correlation and linear frequency relationships, we may want to look into distribution of the frequencies:

Some of the words clearly have heavier tails than the others - get is much more frequent generally than love despite both of them having comparable overall count. We can also visuallly check relationship between words with scatterplots.

Scatterplots which have concentration of the points around top right and bottom left corner tend to appear a lot together, while the words with concentration in the center have moderate appearence in documents together.

Predictive algorithm

In the final project I will bring an attemp to create predictive algorithm based on Katz’s Back-off model with Good-Turing smoothing. The intuition of this method is is as follows: the best prediction for the next word is estimation with highest likelihood probability of observing this word together with n previous words. This likelihood is then smoothed with Good-Turing algorithm. If this n-gram hasn’t been observed, then we take likelihood of observing (n-1) words.