This work summarizes the data for the final project in Data Science specialization. The report will present the content of the files, basic statistics concerning the text data and its graphical representation. The end section will also discuss the predictive algorithm I am going to deploy for the final application.
The data provided for the project contains 4 different folders (based on used language - English, German, French, Russian), each of them is holding 3 text files. Those files represent text data from twitter, blogs and the news.
Folders:
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
Content of the en_US folder:
## size isdir
## ./final/en_US/en_US.blogs.txt 210160014 FALSE
## ./final/en_US/en_US.twitter.txt 167105323 FALSE
## ./final/en_US/en_US.news.txt 205811888 FALSE
Total line count:
length(twitter.vc)
## [1] 2360148
length(news.vc)
## [1] 1010242
length(blogs.vc)
## [1] 899288
To split data into words and n-grams I used NGramTokenizer() function provided by RWeka package. Overall word count would be as follows for each file:
length(twitter.words)
## [1] 18255256
length(news.words)
## [1] 20254884
length(blogs.words)
## [1] 20135119
Top five most frequent words in english files are as follows:
## news.txt twitter.txt blogs.txt
## 1 said just one
## 2 will like will
## 3 one get can
## 4 new love just
## 5 can good like
For the further analysis I will start to use samples from the data due to limitation of my physical memory. Sample is then stored into Corpus - variable that contain the text and associated with it metadata.
Another interesting thing to explore is association and dependence between words. For this reason we need to build Document-Term Matrix (DTM), which represent a table where rows (Documents) are available texts and columns (Terms) - either words or expressions (n-grams) that have been encountered in those documents. Good functionality to work with DTMs can be found in tm package.
For instance, if we take one of the most frequent words in the twitter.txt data, we can find words and phrases most associated with it (based on correlation among frequencies):
## love
## i lov 0.52
## love you 0.31
## love it 0.28
## love to 0.27
## love and 0.25
## a lov 0.24
## love ar 0.24
## love th 0.23
## love lov 0.22
## i love it 0.20
## i love you 0.20
## in lov 0.20
We can also look at associations more generally using graph representation of the link between words:
Moving away from correlation and linear frequency relationships, we may want to look into distribution of the frequencies:
Some of the words clearly have heavier tails than the others - get is much more frequent generally than love despite both of them having comparable overall count. We can also visuallly check relationship between words with scatterplots.
Scatterplots which have concentration of the points around top right and bottom left corner tend to appear a lot together, while the words with concentration in the center have moderate appearence in documents together.
In the final project I will bring an attemp to create predictive algorithm based on Katz’s Back-off model with Good-Turing smoothing. The intuition of this method is is as follows: the best prediction for the next word is estimation with highest likelihood probability of observing this word together with n previous words. This likelihood is then smoothed with Good-Turing algorithm. If this n-gram hasn’t been observed, then we take likelihood of observing (n-1) words.