Abstract

This report is part of the Coursera Data Science Specialization Capstone Project. The goal of this project is the development of a predictive data product which, given a text phrase, forecasts the successive words. The aim of the report is to explain my exploratory analysis and my goals for the application and algorithm.

Data acquisition and cleaning

The project uses the training text data from a corpus called HC Corpora. I use the Russian text in the files ru_RU.blogs.txt, ru_RU.news.txt and ru_RU.twitter.txt. This text is originated from blogs, news feeders and twitter messages respectively.

Data loading

I have manually downloaded the corpus from the Capstone Dataset. Follows the count of lines, words and characters in every file executed on the operating system level:

$ wc *.txt
  337100 9691167 116855835 ru_RU.blogs.txt
  196360 9416099 118996424 ru_RU.news.txt
  881414 9542485 105182346 ru_RU.twitter.txt
 1414874 28649751 341034605 total

The corpus contains about 30 milions of words. Then I have read the data in three variables.

blogs_fn <- "~/Coursera/Capstone/final/ru_RU/ru_RU.blogs.txt"
blogs_lines <- readLines(file(blogs_fn, "r"))
news_fn <- "~/Coursera/Capstone/final/ru_RU/ru_RU.news.txt"
news_lines <- readLines(file(news_fn, "r"))
twitter_fn <- "~/Coursera/Capstone/final/ru_RU/ru_RU.twitter.txt"
twitter_lines <- readLines(file(twitter_fn, "r"))

Every variable is a character vector, each element being a text line from the corresponding file. Follow first six lines of the blog file.

head(blogs_lines)
## [1] "Настало время и мне поделиться чем-нибудь сладким!!! Уже совсем скоро наступит Новый год, и поэтому моя конфека посвящается этому чудесному празднику!"                                                                                                                                                                                                                                                                                
## [2] "сама элегантность и выдержанность...."                                                                                                                                                                                                                                                                                                                                                                                                 
## [3] "Знаменитые дизайнеры, популярные магазины дамской одежды и известные модницы - все сейчас увлечены женственным и романтичным стилем 70-х, свободным и легким, цветочными паттернами, оборками и кружевами. И умеющей вязать даме нет необходимости тратить баснословные суммы, чтобы выглядеть стильно, женственно и современно - достаточно взять в руки крючок и самой сотворить небольшое вязаное платьице а-ля Диор или Феррагамо."
## [4] "Проверяем точно ли продавец высылает в Россию? Нажимаем на вкладку и читаем:"                                                                                                                                                                                                                                                                                                                                                          
## [5] "Я ухожу, но не прощаюсь,"                                                                                                                                                                                                                                                                                                                                                                                                              
## [6] "Ну как? Жду ваших замечаний..."

Tokenization

In order to work with words, I have divided every text line in the datasets into tokens: words, numbers and punctuation symbols. The tokens are diveded by white spaces. I also consider numbers the data and the time strings such as “16/12/1961”, “14:28”, “2015-03-23”. I will not predict numbers and punctuation, but I will recognize them in the input; so I convert them into tokens _num_ and _punct_. I place also _punct_ in the beginning of every line to mark the phrase beginning.

The text is also conerted to lowercase in order to consider the lowcase and the uppercase versions as the same word. I also converted all “ё” in “е” because those letters are often confused. –

tokenize <- function(b) {
  b <- tolower(b)
  b <- gsub("ё", "е", b)
  b <- gsub("(\\-|\\+)*[[:digit:]]+(\\-|\\/|\\:|\\,|\\.)*[[:digit:]]+", "0", b)
  b <- gsub("(?!\\-)[[:punct:]]+", " _punct_ ", b, perl = TRUE)
  b <- gsub("([[:alpha:]])\\-([[:alpha:]])", "\\1_defis_\\2", b)
  b <- gsub("\\-+", " _punct_ ", b)
  b <- gsub("\\—+", " _punct_ ", b)
  b <- gsub("_defis_", "-", b)
  b <- gsub("[[:digit:]]+e[[:digit:]]+", " _num_ ", b)
  b <- gsub("[[:digit:]]+", " _num_ ", b)
  b <- gsub("[[:space:]]+", " ", b)
  b <- paste0("_punct_ ", b)
}
blogs_lines <- tokenize(blogs_lines)
news_lines <- tokenize(news_lines)
twitter_lines <- tokenize(twitter_lines)

Follow first lines of the tokenized blog text.

head(blogs_lines)
## [1] "_punct_ настало время и мне поделиться чем-нибудь сладким _punct_ уже совсем скоро наступит новый год _punct_ и поэтому моя конфека посвящается этому чудесному празднику _punct_ "                                                                                                                                                                                                                                                                                                                                            
## [2] "_punct_ сама элегантность и выдержанность _punct_ "                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [3] "_punct_ знаменитые дизайнеры _punct_ популярные магазины дамской одежды и известные модницы _punct_ все сейчас увлечены женственным и романтичным стилем _num_ _punct_ х _punct_ свободным и легким _punct_ цветочными паттернами _punct_ оборками и кружевами _punct_ и умеющей вязать даме нет необходимости тратить баснословные суммы _punct_ чтобы выглядеть стильно _punct_ женственно и современно _punct_ достаточно взять в руки крючок и самой сотворить небольшое вязаное платьице а-ля диор или феррагамо _punct_ "
## [4] "_punct_ проверяем точно ли продавец высылает в россию _punct_ нажимаем на вкладку и читаем _punct_ "                                                                                                                                                                                                                                                                                                                                                                                                                           
## [5] "_punct_ я ухожу _punct_ но не прощаюсь _punct_ "                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [6] "_punct_ ну как _punct_ жду ваших замечаний _punct_ "

Profanity filtering

I have compiled Russian profanity (vulgar words) dictionary using the Russian vulgarities category page of Wiktionary. There are 94 words in this dictionary. I will not predict those words, but I will recognize them in the input. For this reason I convert every occurance of profanity word in the token _bad_.

fn <- "~/Coursera/Capstone/bad_ru.txt" # dictionary
badwords <- readLines(file(fn, "r")) # vector of bad words
b <- tolower(gsub("ё", "е", badwords)) cleaning
bw <- paste0("\\b(", paste(badwords, collapse="|"), ")\\b") # separated by pipeline
blogs_lines <- gsub(bw, " _bad_ ", blogs_lines) # bad words substituted
news_lines <- gsub(bw, " _bad_ ", news_lines) # bad words substituted
twitter_lines <- gsub(bw, " _bad_ ", twitter_lines) # bad words substituted

Splitting into tokens

I split the text lines into tokens. Every variable is vector of tokens in the original order.

blogs_tokens <- unlist(strsplit(blogs_lines, " +"))
news_tokens <- unlist(strsplit(news_lines, " +"))
twitter_tokens <- unlist(strsplit(twitter_lines, " +"))

See first six tokens of the blog text:

head(blogs_tokens)
## [1] "_punct_"    "настало"    "время"      "и"          "мне"       
## [6] "поделиться"

Exploratory analysis

The goal of the exploratory analysis is to understand the principal relations in the data and prepare to build predictive linguistic models. I perform the analysis on the blogs, news and twitter data grouped together.

Frequency distribution of tokens

I calculate for every token (excluding _punct_) the frequency in the text. I have ordered the distinct token values by the decreasing frequency.

Follow fhe first most frequent tokens and the cumulative frequency plot. I plot also the frequency of stemmed words and horisontal lines for the cumulative frequncy of 0.5 and 0.9.

## 
##           в           и       _num_          на          не           с 
## 0.033348183 0.029465053 0.019278218 0.017606984 0.017557922 0.011302557 
##         что           я           а          по         это         как 
## 0.011221159 0.008789420 0.008788987 0.007541758 0.006198246 0.006134046 
##           –           у         все          но          за         для 
## 0.005625537 0.004987001 0.004857661 0.004804913 0.004385208 0.004241308 
##           к          из 
## 0.004196292 0.003998850

One needs about 1000 unique words to cover 50% of all word instances and about 30000 to cover 90%. One needs about 3 times less stemmed words to achieve the same coverages.

Frequency distribution of 2-grams

The frequency distribution of two consecutive tokens serves as first step in order to forecast successive words in a given phrase. I compose the paires of consecutive tokens from the training text. Follow fhe first most frequent couples and the cumulative frequency plot.

##         a1      a2       N
## 1: _punct_ _punct_ 1382595
## 2: _punct_     что  233552
## 3: _punct_       а  231386
## 4: _punct_       в  200348
## 5:   _num_ _punct_  182022
## 6: _punct_       и  169213

Frequency distribution of 3-grams

The frequency distribution of three consecutive tokens gives an idea about the complete number of the most frequent trigrams to be taken into the model development. I compose the triplets of consecutive tokens from the training text. Follow fhe first most frequent triplets and the cumulative frequency plot.

##         a1      a2      a3      N
## 1: _punct_ _punct_ _punct_ 103458
## 2: _punct_   _num_ _punct_  77396
## 3: _punct_ _punct_       в  50235
## 4: _punct_ _punct_       а  42820
## 5: _punct_ _punct_   _num_  40290
## 6: _punct_ _punct_       я  34567

Conclusion

I can develop a predictive model using the distributions of trigrams. The idea of algorythm is the following. The inputted text is transformed into tokens. I can search for all the trigrams beginning with the last two inputted tokens. The third token of the most frequent trigrams among them are the most probable words to propose to the user.