(Hidden) Reading stings from all sources of data and writing them to variable “strings”
Count of lines ( “strings” )
## [1] 3336695
Overall count of words in all “strings” (lines):
## [1] 71099622
Word frequency graph (1-gram graph):
10% of words are contained in 98.3% of all texts. These words are met 25 (or more) times in all texts in database. It can be shown, that processing full corpora is highly time- and computationally-demanding.
Let’s see what happens if we shorten our database 10 times (will take
only 10% of all phrases from corpora). On the graph below we can get the
following information: which portion of the words from the basic data
set can be found in shorten data set. For example, from the first 10000
words of basic data set we can find more than 0.96.. % (i.e. about 9600
words) in shorten data set.
We can conclude from the graph, that if we 10 times reduce initial corpora set (randomly taking 10% of phrases from initial corpora), we will lose less than 13% words from the most frequently used 45000 words. It should be said that in accordance to the article https://www.economist.com/johnson/2013/05/29/lexical-facts, most adult native people have vocabulary in range 20,000–35,000 words. Hence, we can suppose that 39150 words (= (100%-13%) * 45000) could be a good point to begin in development of the 4-gram self-improving algorithm. We will proceed with 10% of phrases from initial corpora
To increase productivity of our algorithm, we will use codes instead of words. For that purpose we build frequency sorted dictionary and introduce unique code for each word. Using codes , we decrease computation time of our algorithm. Initially, for the shorten data corpora (10% of initial data corpus) we can create frequency sorted dictionary with all words from the corpora included. The length of that dictionary is 132097 words. However, 5% of the most frequent words cover more than 90% of words in our shorten corpora. We will use the most frequent 5% words from frequency sorted dictionary (i.e. 6625 words). Our 4-gram algorithm will have an option to add new words in the dictionary so that it won’t stuck dealing with words which aren’t included in the dictionary.
head(mapping_words_codes_shorten,10)
## words code
## 1 the 1
## 2 to 2
## 3 i 3
## 4 and 4
## 5 a 5
## 6 of 6
## 7 in 7
## 8 you 8
## 9 it 9
## 10 is 10
To prepare data for our 4-gram algorithm, we create the table with the following columns: - First word - Second word - Third word - Fourth word - 4-gram count in the data corpus - 4-gram frequency in the data corpus. We won’t include in 4-gram data base 4-grams with NAs (4 grams with NAs are expected to occur because previously we decided to use 5% of frequency sorted dictionary, so we shouldn’t expect that for all words in all 4-grams we will have codes in frequency sorted dictionary)
head(global_four_grams_count_df,10)
## # A tibble: 10 × 6
## one two three four count freq
## <int> <int> <int> <int> <int> <dbl>
## 1 3 71 18 69 864 0.000201
## 2 32 18 220 2 852 0.000198
## 3 3 44 88 2 776 0.000180
## 4 93 12 1 158 636 0.000148
## 5 1 237 6 1 538 0.000125
## 6 3 71 18 84 528 0.000123
## 7 71 18 103 2 525 0.000122
## 8 1 502 6 1 452 0.000105
## 9 23 1 237 6 418 0.0000970
## 10 3 71 18 21 397 0.0000922
We can also see on the graph how frequent different 4-grams accur in the corpora:
plot(1:3000, global_four_grams_count_df$count[1:3000],
xlab = '4-gram index in database',
ylab = 'Count of 4-grams in corpora')
4-gram data base is the basis of our prediction model. The following code demonstrates how it predicts 4th word based on previous three words.
first_word='time' # code: 53
second_word='to' # code: 2
third_word ='say' # code: 140
fourth_word_prediction_options_g <- fourth_word_prediction(first_word=first_word, # First word
second_word=second_word, # Second word
third_word =third_word, # Thirs word
f_mapping_words_codes_shorten = mapping_words_codes_shorten, # Frequency sorted dictionary
f_global_four_grams_count_df = global_four_grams_count_df) # 4-gram data base
fourth_word_prediction_options_g[[2]]
## # A tibble: 3 × 7
## one two three four count freq words
## <int> <int> <int> <int> <int> <dbl> <chr>
## 1 53 2 140 3133 4 0.000000928 goodbye
## 2 53 2 140 2308 1 0.000000232 bye
## 3 53 2 140 559 1 0.000000232 fuck
If there is no relevant 4-gram if 4-gram data base, algorithm won’t go in Error. Oppositely, model will learn new words from provided 4-gram and adjust underlying data.
The result of prediction is up to 3 options words for user (so called “quick options”).User are expected to choose one of them to conclude their 4-gram. These options are provided based of frequency of 4-grams in 4-gram data base. Possible user actions in respond to suggested quick options can be the following: - Choosing one from suggested options; - Typing new word , which is not included in the list of options. In case of choosing one of “quick options”: 1. Existing 4-gram in 4-gram data base will be updated. Then frequency of 4-grams will be re-calculated for whole n-gram data base. In case of typing new word: 1. Word will be checked in frequency sorted dictionary. If the word is not in the dictionary, dictionary will be updated 2. Either new 4-gram will be added into the 4-gram data base with count 1, OR the count of the same existing 4-gram in 4-gram data base will be updated (count=count+1). Then frequency of 4-grams will be re-calculated for whole n-gram data base. Updating dictionary and 4-gram data base will gradually adjust these entities to the user language habits. For example, if user often types “time to say stop” (initial count for corresponded 4-gram is equal to zero), corresponded 4-gram will be gradually supposed to be more and more frequently used. “Count” will be gradually “accumalated” in the model. After achieving particular threshold of “count” for that 4-gram, the word “stop” may be offered to user as a quick option.
User interface prototype:
Count of user options inside the program: 1,2,3 - suggested “quick options” 4 - users types something else
choice_amount_g <- nrow(fourth_word_prediction_options_g[[2]]) + 1
choice_amount_g
## [1] 4
Options vector:
user_choice_options_g <- 1:choice_amount_g
user_choice_options_g
## [1] 1 2 3 4
user_choice_example = 3
input_word_example = fourth_word_prediction_options_g[[2]]$words[3] #
User selects an option:
fourth_word_selection(fourth_word_prediction_options = fourth_word_prediction_options_g,
user_choice_options = user_choice_options_g,
user_choice = user_choice_example,
input_word = input_word_example)
## # A tibble: 1 × 7
## one two three four count freq words
## <int> <int> <int> <int> <int> <dbl> <chr>
## 1 53 2 140 559 1 0.000000232 fuck
Now, if we check 4-gram data base, we will find 4-gram “53, 2, 140, 135” updated.
Compare:
fourth_word_prediction_options_g[[2]]
## # A tibble: 3 × 7
## one two three four count freq words
## <int> <int> <int> <int> <int> <dbl> <chr>
## 1 53 2 140 3133 4 0.000000928 goodbye
## 2 53 2 140 2308 1 0.000000232 bye
## 3 53 2 140 559 1 0.000000232 fuck
fourth_word_prediction(first_word=first_word, # First word
second_word=second_word, # Second word
third_word =third_word, # Thirs word
f_mapping_words_codes_shorten = mapping_words_codes_shorten, # Frequency sorted dictionary
f_global_four_grams_count_df = global_four_grams_count_df)[[2]] # 4-gram data base
## # A tibble: 3 × 7
## one two three four count freq words
## <int> <int> <int> <int> <dbl> <dbl> <chr>
## 1 53 2 140 3133 4 0.000000928 goodbye
## 2 53 2 140 559 2 0.000000464 fuck
## 3 53 2 140 9 1 0.000000232 it
Count and frequency (freq) for 4-gram “53, 2, 140, 135” have been updated. Everything works good.
On the following steps of the project that algorithm will be implements as a Shiny app