Data Science Capstone - Module 2

Loading Data

(Hidden) Reading stings from all sources of data and writing them to variable “strings”

Exploratory data analysis

Count of lines ( “strings” )

## [1] 3336695

Overall count of words in all “strings” (lines):

## [1] 71099622

Word frequency graph (1-gram graph):

10% of words are contained in 98.3% of all texts. These words are met 25 (or more) times in all texts in database. It can be shown, that processing full corpora is highly time- and computationally-demanding.

Let’s see what happens if we shorten our database 10 times (will take only 10% of all phrases from corpora). On the graph below we can get the following information: which portion of the words from the basic data set can be found in shorten data set. For example, from the first 10000 words of basic data set we can find more than 0.96.. % (i.e. about 9600 words) in shorten data set.

We can conclude from the graph, that if we 10 times reduce initial corpora set (randomly taking 10% of phrases from initial corpora), we will lose less than 13% words from the most frequently used 45000 words. It should be said that in accordance to the article https://www.economist.com/johnson/2013/05/29/lexical-facts, most adult native people have vocabulary in range 20,000–35,000 words. Hence, we can suppose that 39150 words (= (100%-13%) * 45000) could be a good point to begin in development of the 4-gram self-improving algorithm. We will proceed with 10% of phrases from initial corpora

Frequency sorted dictionary

To increase productivity of our algorithm, we will use codes instead of words. For that purpose we build frequency sorted dictionary and introduce unique code for each word. Using codes , we decrease computation time of our algorithm. Initially, for the shorten data corpora (10% of initial data corpus) we can create frequency sorted dictionary with all words from the corpora included. The length of that dictionary is 132097 words. However, 5% of the most frequent words cover more than 90% of words in our shorten corpora. We will use the most frequent 5% words from frequency sorted dictionary (i.e. 6625 words). Our 4-gram algorithm will have an option to add new words in the dictionary so that it won’t stuck dealing with words which aren’t included in the dictionary.

head(mapping_words_codes_shorten,10)

##    words code
## 1    the    1
## 2     to    2
## 3      i    3
## 4    and    4
## 5      a    5
## 6     of    6
## 7     in    7
## 8    you    8
## 9     it    9
## 10    is   10

4-gram data base

To prepare data for our 4-gram algorithm, we create the table with the following columns: - First word - Second word - Third word - Fourth word - 4-gram count in the data corpus - 4-gram frequency in the data corpus. We won’t include in 4-gram data base 4-grams with NAs (4 grams with NAs are expected to occur because previously we decided to use 5% of frequency sorted dictionary, so we shouldn’t expect that for all words in all 4-grams we will have codes in frequency sorted dictionary)

head(global_four_grams_count_df,10)

## # A tibble: 10 × 6
##      one   two three  four count      freq
##    <int> <int> <int> <int> <int>     <dbl>
##  1     3    71    18    69   864 0.000201 
##  2    32    18   220     2   852 0.000198 
##  3     3    44    88     2   776 0.000180 
##  4    93    12     1   158   636 0.000148 
##  5     1   237     6     1   538 0.000125 
##  6     3    71    18    84   528 0.000123 
##  7    71    18   103     2   525 0.000122 
##  8     1   502     6     1   452 0.000105 
##  9    23     1   237     6   418 0.0000970
## 10     3    71    18    21   397 0.0000922

We can also see on the graph how frequent different 4-grams accur in the corpora:

plot(1:3000, global_four_grams_count_df$count[1:3000], 
     xlab = '4-gram index in database',
     ylab = 'Count of 4-grams in corpora')

Basic n-gram model for predicting next word based on the previous 3 words.

4-gram data base is the basis of our prediction model. The following code demonstrates how it predicts 4th word based on previous three words.

first_word='time' # code: 53
second_word='to' # code: 2
third_word ='say' # code: 140
fourth_word_prediction_options_g <- fourth_word_prediction(first_word=first_word, # First word
                       second_word=second_word,  # Second word
                       third_word =third_word, # Thirs word
                       f_mapping_words_codes_shorten = mapping_words_codes_shorten, # Frequency sorted dictionary
                       f_global_four_grams_count_df = global_four_grams_count_df) # 4-gram data base
fourth_word_prediction_options_g[[2]]

## # A tibble: 3 × 7
##     one   two three  four count        freq words  
##   <int> <int> <int> <int> <int>       <dbl> <chr>  
## 1    53     2   140  3133     4 0.000000928 goodbye
## 2    53     2   140  2308     1 0.000000232 bye    
## 3    53     2   140   559     1 0.000000232 fuck

If there is no relevant 4-gram if 4-gram data base, algorithm won’t go in Error. Oppositely, model will learn new words from provided 4-gram and adjust underlying data.

The result of prediction is up to 3 options words for user (so called “quick options”).User are expected to choose one of them to conclude their 4-gram. These options are provided based of frequency of 4-grams in 4-gram data base. Possible user actions in respond to suggested quick options can be the following: - Choosing one from suggested options; - Typing new word , which is not included in the list of options. In case of choosing one of “quick options”: 1. Existing 4-gram in 4-gram data base will be updated. Then frequency of 4-grams will be re-calculated for whole n-gram data base. In case of typing new word: 1. Word will be checked in frequency sorted dictionary. If the word is not in the dictionary, dictionary will be updated 2. Either new 4-gram will be added into the 4-gram data base with count 1, OR the count of the same existing 4-gram in 4-gram data base will be updated (count=count+1). Then frequency of 4-grams will be re-calculated for whole n-gram data base. Updating dictionary and 4-gram data base will gradually adjust these entities to the user language habits. For example, if user often types “time to say stop” (initial count for corresponded 4-gram is equal to zero), corresponded 4-gram will be gradually supposed to be more and more frequently used. “Count” will be gradually “accumalated” in the model. After achieving particular threshold of “count” for that 4-gram, the word “stop” may be offered to user as a quick option.

Let’s conclude with the case demonstration:

User interface prototype:

Count of user options inside the program: 1,2,3 - suggested “quick options” 4 - users types something else

choice_amount_g <- nrow(fourth_word_prediction_options_g[[2]]) + 1 
choice_amount_g

## [1] 4

Options vector:

user_choice_options_g <- 1:choice_amount_g
user_choice_options_g

## [1] 1 2 3 4

user_choice_example = 3
input_word_example = fourth_word_prediction_options_g[[2]]$words[3] #

User selects an option:

fourth_word_selection(fourth_word_prediction_options = fourth_word_prediction_options_g,  
                      user_choice_options = user_choice_options_g, 
                      user_choice = user_choice_example,
                      input_word = input_word_example)

## # A tibble: 1 × 7
##     one   two three  four count        freq words
##   <int> <int> <int> <int> <int>       <dbl> <chr>
## 1    53     2   140   559     1 0.000000232 fuck

Now, if we check 4-gram data base, we will find 4-gram “53, 2, 140, 135” updated.

Compare:

Relevant 3-grams before user request above:

fourth_word_prediction_options_g[[2]]

## # A tibble: 3 × 7
##     one   two three  four count        freq words  
##   <int> <int> <int> <int> <int>       <dbl> <chr>  
## 1    53     2   140  3133     4 0.000000928 goodbye
## 2    53     2   140  2308     1 0.000000232 bye    
## 3    53     2   140   559     1 0.000000232 fuck

3-grams after the one more (and the same) user request:

fourth_word_prediction(first_word=first_word, # First word
                       second_word=second_word,  # Second word
                       third_word =third_word, # Thirs word
                       f_mapping_words_codes_shorten = mapping_words_codes_shorten, # Frequency sorted dictionary
                       f_global_four_grams_count_df = global_four_grams_count_df)[[2]] # 4-gram data base

## # A tibble: 3 × 7
##     one   two three  four count        freq words  
##   <int> <int> <int> <int> <dbl>       <dbl> <chr>  
## 1    53     2   140  3133     4 0.000000928 goodbye
## 2    53     2   140   559     2 0.000000464 fuck   
## 3    53     2   140     9     1 0.000000232 it

Count and frequency (freq) for 4-gram “53, 2, 140, 135” have been updated. Everything works good.

On the following steps of the project that algorithm will be implements as a Shiny app

Data Science Capstone - Module 2 - Milestone Report

A.B.

2025-01-24