This report is part of the Coursera Data Science Specialization Capstone Project. The goal of this project is the development of a predictive data product which, given a text phrase, forecasts the successive words. The aim of the report is to explain my exploratory analysis and my goals for the application and algorithm.
The project uses the training text data from a corpus called HC Corpora. I use the Russian text in the files ru_RU.blogs.txt, ru_RU.news.txt and ru_RU.twitter.txt. This text is originated from blogs, news feeders and twitter messages respectively.
I have manually downloaded the corpus from the Capstone Dataset. Follows the count of lines, words and characters in every file executed on the operating system level:
$ wc *.txt
337100 9691167 116855835 ru_RU.blogs.txt
196360 9416099 118996424 ru_RU.news.txt
881414 9542485 105182346 ru_RU.twitter.txt
1414874 28649751 341034605 total
The corpus contains about 30 millions of words. Then I have read the data in three variables.
Every variable is a character vector, each element being a text line from the corresponding file. Follow first lines of the blog file.
## [1] "Настало время и мне поделиться чем-нибудь сладким!!! Уже совсем скоро наступит Новый год, и поэтому моя конфека посвящается этому чудесному празднику!"
## [2] "сама элегантность и выдержанность...."
In order to work with words, I have divided every text line in the datasets into tokens: words, numbers and punctuation symbols. The tokens are divided by white spaces. I also consider numbers the data and the time strings such as “16/12/1961”, “14:28”, “2015-03-23”. I will not predict numbers and punctuation, but I will recognize them in the input; so I convert them into tokens _num_ and _punct_. I place also _punct_ in the beginning of every line to mark the phrase beginning.
The text is also converted to lowercase in order to consider the lowercase and the uppercase versions as the same word. I also converted all “ё” in “е” because those letters are often confused.
Follow first lines of the tokenized blog text.
## [1] "_punct_ настало время и мне поделиться чем-нибудь сладким _punct_ уже совсем скоро наступит новый год _punct_ и поэтому моя конфека посвящается этому чудесному празднику _punct_ "
## [2] "_punct_ сама элегантность и выдержанность _punct_ "
I have compiled Russian profanity (vulgar words) dictionary using the Russian vulgarities category page of Wiktionary. There are 94 words in this dictionary. I will not predict those words, but I will recognize them in the input. For this reason I convert every occurence of profanity word in the token _bad_.
I split the text lines into tokens. Every variable is vector of tokens in the original order.
See first six tokens of the blog text:
## [1] "_punct_" "настало" "время" "и" "мне"
## [6] "поделиться"
The goal of the exploratory analysis is to understand the principal relations in the data and prepare to build predictive linguistic models. I perform the analysis on the blogs, news and twitter data grouped together.
I calculate for every token (excluding _punct_) the frequency in the text. I have ordered the distinct token values by the decreasing frequency.
Follow the first most frequent tokens and the cumulative frequency plot. I plot also the frequency of stemmed words and horizontal lines for the cumulative frequency of 0.5 and 0.9.
##
## в и _num_ на не с
## 0.03334818 0.02946505 0.01927822 0.01760698 0.01755792 0.01130256
One needs about 1000 unique words to cover 50% of all word instances and about 30000 to cover 90%. One needs about 3 times less stemmed words to achieve the same coverages.
The frequency distribution of two consecutive tokens serves as first step in order to forecast successive words in a given phrase. I compose the pairs of consecutive tokens from the training text, excluding the pair _punct_ _punct_. Follow the first most frequent couples and the cumulative frequency plot.
## a1 a2 N
## 1: _punct_ что 233552
## 2: _punct_ а 231386
## 3: _punct_ в 200348
## 4: _num_ _punct_ 182022
## 5: _punct_ и 169213
## 6: _punct_ _num_ 143028
The frequency distribution of three consecutive tokens gives an idea about the complete number of the most frequent trigrams to be taken into the model development. I compose the triplets of consecutive tokens from the training text. Follow the first most frequent triplets and the cumulative frequency plot.
## a1 a2 a3 N
## 1: в _num_ год 15040
## 2: _punct_ у мен 14545
## 3: том _punct_ что 14004
## 4: то _punct_ что 13097
## 5: _punct_ что в 11172
## 6: _punct_ что он 10222
I can develop a predictive model using the distributions of trigrams. The idea of algorithm is the following. The inputted text is transformed into tokens. I can search for all the trigrams beginning with the last two inputted tokens. The third token of the most frequent trigrams among them are the most probable words to propose to the user.