Data Science Capstone Milestone report

Executive summary

The goal of the Data Science Capstone is to build a statistical models for text-prediction based on the SwiftKey US dataset. The models will be then deployed via a publicly available Shiny app. The current report will detail some basic features of the dataset and the strategy for model building. The analyses described in this summary are the basis for the development of the shinyapp Pre3d1ct0, already available at the website filippo-ciceri.shinyapps.io/Pr3d1ct0/. A quick introduction on Pr3d1ct0 is also available here.

Exploratory data analysis

The swiftKey US dataset provided for the analysis is divided in 3 sections (blogs, news and twitter). This section summarizes some of the basic features of each dataset.

Blogs

total number of lines = 899,288
maximum line length = 40,835
mean line length = 231.696
total number of words = 37,334,131
- characters-only = 36,778,716
- digits-only = 157,502
- characters and punctuation marks = 5,402,180
- digits and punctuation marks = 253,862

News

total number of lines = 77,259
maximum line length = 5,760
mean line length = 203.0024
total number of words = 2,643,969
- characters-only = 2,569,108
- digits-only = 23,974
- characters and punctuation marks = 415,828
- digits and punctuation marks = 40,207

Twitter

total number of lines = 2,360,148
maximum line length = 213
mean line length = 68.80281
total number of words = 30,373,545
- characters-only = 29,284,611
- digits-only = 191,790
- characters and punctuation marks = 5,351,139
- digits and punctuation marks = 313,916

Subsetting and filtering the dataset

Generating the training and testing dataset

40% of each SwiftKey dataset (blog, news, Twitter) was selected for building the statistical models, while the rest of the data (testing dataset) was reserved for performance evaluation.

Filtering the training dataset

First, the text was converted to lower-case, removing any sequence of characters containing digits and puctuation marks.

Next, prophanities were removed, based on a public list prophanities_list.txt, commonly used for spam filtering and parental control.

After these filtering steps the total number of words in the training dataset is 13,743,413.

The frequency of each word in the training dataset (Freq) was obtained using the table function of R. Probability (Maximum Likelihood Extimation - P) and cumulative probability (cum_P) were also calculated.

The words in the training dataset are distributed as follows:

132 words are providing 50% dataset coverage
6,663 words are providing 90% dataset coverage
18,557 words are providing 95% dataset coverage

The list of the 10 most common words is presented here below.

   all_words   Freq
1        the 494955
2         to 370589
3          a 296345
4          i 295004
5        and 228001
6        you 223462
7         in 191727
8         of 190306
9        for 175610
10        is 160774

Filtering non-english words (OOV words)

Non-english words were removed from the dataset to speed up the analyses. This was done by comparison against a dictionary of the most common 10.000 words in the English language.

The percentage of out-of-vocabulary (OOV) words in the training dataset is 16.09. After removing OOV words words distribution is:

74 for 50% dataset coverage
1,779 for 90% dataset coverage
3,085 for 95% dataset coverage

Stemming the dataset

Another way to compress the original dataset is to stem words, removing suffixes such as “ing”, “ed”, etc. For example, “working”, “worked” and “works” can be stemmed down to the shorter sequence of characters “work”. This can be easily done via the SnowballC library.

After stemming:

68 words are providing 50% dataset coverage
1,132 words are providing 90% dataset coverage
1,876 words are providing 95% dataset coverage

However I believe that this brings excessive simplification that may be detrimental for the accuracy of the models. For this reason, the ngram models will be built without stemming the training dataset (see next section for more details).

Building the n-gram models

A bit of theory

Applying the chain rule of probability, probability for a sequence of words (\(P(w_1...w_n)\)) can be computed as

\(P(w_1^n)=P(w_1)P(w_2|w_1)P(w_3|w_1^2)...P(w_n|w_1^{n-1})=\prod_{k = 1}^{n}P(w_k|w_1^{k-1})\)

With the Markov assumption that the probability of each word depends only on the previous one we can simplify the previous formula to

\(P(w_1^n)\approx\prod_{k = 1}^{n}P(w_k|w_{k-1})\)

where \(n\) is the number of words in the sequence.

This is the base for building a prediction model based on pairs of words (2-grams).

More complex models (3-grams, 4-grams, 5-grams) are based on similar assumptions.

\(P(w_1^n)\approx\prod_{k = 1}^{n}P(w_k|w_{k-1}w_{k-2})\)

\(P(w_1^n)\approx\prod_{k = 1}^{n}P(w_k|w_{k-1}w_{k-2}w_{k-3})\)

\(P(w_1^n)\approx\prod_{k = 1}^{n}P(w_k|w_{k-1}w_{k-2}w_{k-3}w_{k-4})\)

1-gram model

The 1-gram model can be defined as the frequency of single words across the dataset. The probability of each single word is calculated according to the Maximum Likelihood Eximation using the table function.

The number of single words in the 1-gram model is 8,155.

2-gram model

The 2-gram model is calculated from the frequency of each pairs of words (2-grams) in the datasets. The list of 2-grams is calculated using the gregexpr("([a-z']+ [a-z']+)", text) command to select every combination of 2 sequences of characters separed by a space.

The total number of 2-grams in the training dataset is 6,785,482 (2,047,706 unique 2-grams).

After removing 2-grams containing prophanities and OOV words the number of unique 2-grams is 4,793,982 (842,593 unique 2-grams). The frequency of each 2-gram is stored in a dataframe for further manipulation.

The 2-frams in the training dataset are distributed as follows:

9,895 2-grams for 50% dataset coverage
363,194 2-grams for providing 90% dataset coverage
602,893 2-grams for providing 95% dataset coverage

The list of the 10 most common 2-grams is here below.

           bigram  freq
356477     in the 20957
500664     of the 17100
268981    for the 16728
509722     on the 12143
745652     to the 11124
742024      to be 10744
709342 thanks for  9582
58619      at the  9340
347207     i love  7835
296148   going to  7690

The plot above suggests that 2-grams frequency follow a Zipf distribution, where the vast majority of the 2-grams are used only rarely within the training dataset. In order to reduce the size of the model, such 2-grams will be removed because they provide little additional predictive power (as their probability is close to \(1/V\), the probability of extracting a random word from the dictionary - V is the length of 1-gram dictionary). The process of removing low probability n-grams is generally referred as “n-grams pruning”.

   freq 2-grams removed       N reduction
1                   < 1       0 0.0000000
2                   < 2  525651 0.1096481
3                   < 3  769655 0.1605461
4                   < 4  924377 0.1928203
5                   < 5 1042109 0.2173786
6                   < 6 1137554 0.2372879
7                   < 7 1218716 0.2542179
8                   < 8 1289087 0.2688969
9                   < 9 1350007 0.2816045
10                 < 10 1405447 0.2931690

The table above shows that removing 2-grams appearing less than 4 times in the dataset achieves a compression of approximately 19.28%. 2-grams appearing less than 4 times were removed from the dataset.

The bigram model is based on the conditional probability of a word \(w_{n}\), knowing the previous word \(w_{n-1}\). Via Maximum Likelihood Existimation, the probability for each couple of words \(w_{1}w_{0}\) is calculated from the observed frequencies in the training dataset and stored in a dataframe structure. The formula used for calculating bigram probabilities from observed frequencies in the dataset is:

\(P(w_{n}|w_{n-1}) = \frac{C(w_{n-1}w_n)}{\sum_{w}C(w_{n-1}w)}=\frac{C(w_{n-1}w_n)}{C(w_{n-1})}\)

Where \(P(w_{n}|w_{n-1})\) represents the conditional probability of the 2-gram \(w_{n-1}w\) and \(C(w_{n-1}w_n)\) the frequency of the same 2-gram in the training dataset.

The dataframe containing all the observed \(w_{n-1}w_n\) 2grams and their probabilities occupies 15.9 Mb. The dataframe can be “casted” into a matrix of 8155x8155 cells using the function acast from the reshape2 package. This matrix covers every possible combination of two words, most of them with an assigned probability of 0 because not observed in the training dataset. As expected, this matrix is considerably larger (334 Mb), compared with the dataframe of observed bigrams. The next sections will discuss in detail how to handle the probability of ngrams not observed in the training dataset.

3-gram model

The 3-gram model is calculated from the frequency of 3-words (3-grams) in the datasets. The sequence of steps is as described in the previous section.

The total number of 3-grams in the training dataset is 4,450,527 (3,201,840 unique 3-grams).

3-grams containing prophanities and OOV were removed (2,664,145 unique 3-grams after filtering), before calculating each frequency and related probability (Maximum Likelihood Eximation).

The 3-grams in our dataset are distributed as follows:

300,594 3-gram for 50% dataset coverage
1,366,252 3-gram for 90% dataset coverage
1,499,459 3-gram for 95% dataset coverage

The list of the 10 most common 3-grams is here below.

                   trigram freq
1264508     thanks for the 3931
1263405      thank you for 1417
626690          i love you 1359
779107  looking forward to 1357
633140           i want to 1103
17047             a lot of 1084
515264         going to be 1073
975177          one of the 1018
462364      for the follow  934
627880           i need to  930

As in the case of 2-grams, also the 3-gram model was pruned to reduce unnecessary information.

   freq 3-grams removed       N reduction
1                   < 1       0 0.0000000
2                   < 2 1361096 0.5108941
3                   < 3 1661186 0.6235344
4                   < 4 1796588 0.6743582
5                   < 5 1884808 0.7074720
6                   < 6 1947718 0.7310856
7                   < 7 1997548 0.7497895
8                   < 8 2037686 0.7648555
9                   < 9 2071246 0.7774524
10                 < 10 2100244 0.7883370

The table above shows that removing 3-grams appearing less than 4 times in the dataset achieves a compression of approximately 67.44%. The database was pruned accordingly.

The 3-grams models is based on the probability of any word \(w_n\), given the two preceding words \(w_{n-2}w_{n-1}\). The dataframe of observed 3-grams frequencies can be converted in a matrix containg 2-grams as rows and tail-words as columns, as described before for the 2-grams model. The formula used for computing probabilities is:

\(P(w_{n}|w_{n-1}w_{n-2}) = \frac{C(w_{n-2}w_{n-1}w_n)}{\sum_{w}C(w_{n-2}w_{n-1}w)}=\frac{C(w_{n-2}w_{n-1}w_n)}{C(w_{n-2}w_{n-1})}\)

Where \(P(w_{n}|w_{n-1}w_{n-2})\) represents the conditional probability of the 3-gram \(w_{n-2}w_{n-1}w\) and \(C(w_{n-2}w_{n-1}w_n)\) the frequency of the same 3-gram in the training dataset.

The size of the dataframe containing all the observed 3-grams is 11.4 Mb, the full 2-grams-vs-tailwords matrix occupies 611.1 Mb.

4-gram model

The 4-gram model is calculated from the frequency of 4-words (4-grams) in the datasets. The sequence of steps is very similar to the one described in the previous sections.

The total number of 4-grams in the training dataset is 1,670,158 (1,462,539 unique 4-grams), distributed as follows:

627,460 4-grams for 50% dataset coverage
1,295,523 4-grams for 90% dataset coverage
1,379,031 4-grams for 95% dataset coverage

The list of the 10 most common 4-grams is here below.

                     tetragram freq
1099342  thanks for the follow  887
1097347      thank you for the  441
1097815      thank you so much  310
605384          is going to be  278
385858      for the first time  267
511629         hope to see you  195
432248           going to be a  186
1099407 thanks for the mention  181
557041          i wish i could  173
565026          if you want to  173

   freq 4-grams removed       N reduction
1                   < 1       0 0.0000000
2                   < 2 1362667 0.8158911
3                   < 3 1503975 0.9004986
4                   < 4 1544211 0.9245898
5                   < 5 1566779 0.9381023
6                   < 6 1581594 0.9469727
7                   < 7 1592034 0.9532236
8                   < 8 1600042 0.9580183
9                   < 9 1606738 0.9620275
10                 < 10 1612111 0.9652446

Most of the 4-grams appear only once is the training dataset. Removing those 4-grams achieve a compression of 81.59%.

Probabilities were calcaluted from the frequencies observed in the dataset according to the formula

\(P(w_{n}|w_{n-1}w_{n-2}w_{n-3}) = \frac{C(w_{n-3}w_{n-2}w_{n-1}w_n)}{\sum_{w}C(w_{n-3}w_{n-2}w_{n-1}w)}=\frac{C(w_{n-3}w_{n-2}w_{n-1}w_n)}{C(w_{n-3}w_{n-2}w_{n-1})}\)

5-gram model

The 5-gram model is calculated from the frequency of 5-words (5-grams) in the training datasets. The sequence of steps is very similar to the one described in the previous sections.

The total number of 5-grams in the training dataset is 1,118,268 (1,069,577 unique 5-grams), distributed as follows:

510,443 5-grams for 50% dataset coverage
957,750 5-grams for 90% dataset coverage
1,013,663 5-grams for 95% dataset coverage

The list of the 10 most common 5-grams is here below.

                      pentagram freq
797047    thank you so much for  102
98832         at the end of the   84
796480 thank you for the follow   81
394095       i love you so much   69
800373   thanks so much for the   67
496726       let me know if you   64
366404    hope you have a great   57
543982     mi run with nike gps   56
274322    for the first time in   55
798911  thanks for the follow i   54

Probabilities were calcaluted from the frequencies observed in the dataset according to the formula

\(P(w_{n}|w_{n-1}|w_{n-2}|w_{n-3}|w_{n-4}) = \frac{C(w_{n-4}w_{n-3}w_{n-2}w_{n-1}w_n)}{\sum_{w}C(w_{n-4}w_{n-3}w_{n-2}w_{n-1}w)}=\frac{C(w_{n-4}w_{n-3}w_{n-2}w_{n-1}w_n)}{C(w_{n-4}w_{n-3}w_{n-2}w_{n-1})}\)

   freq 5-grams removed       N reduction
1                   < 1       0 0.0000000
2                   < 2 1035821 0.9262726
3                   < 3 1092593 0.9770404
4                   < 4 1100999 0.9845574
5                   < 5 1105235 0.9883454
6                   < 6 1107720 0.9905676
7                   < 7 1109454 0.9921182
8                   < 8 1110889 0.9934014
9                   < 9 1111761 0.9941812
10                 < 10 1112499 0.9948411

Most of the 5-grams appear only once is the dataset. So removing those 5-grams achieve a compression of 92.63%.

Summary of the models

       ngrams N Freq dataframe P dataframe Full P matrix
1-gram     8155         0.6 Mb      0.7 Mb          <NA>
2-gram   143366        14.3 Mb     15.9 Mb        334 Mb
3-gram    76392        10.6 Mb     11.4 Mb      611.1 Mb
4-gram    99872        18.5 Mb     19.6 Mb     2620.6 Mb
5-gram    33756           8 Mb      8.4 Mb      885.3 Mb

As previously said, the probability dataframe includes only the probability for ngrams observed in the training dataset, whereas the full probability matrix cover all the possible combinations (the majority of which are assigned with a probability of 0 because unobserved in the training dataset). By comparing the sizes of probability dataframe and matrix for each ngram model it’s clear that the vast majority of the probability matrix of an ngram model is occupied by 0s (sparse datataset). For this reason, instead of keeping in memory the whole matrix it’s more convenient (in term of system resources) to store a dataframe of only the non-zero probabilities, assuming that ngrams excluded from the matrix have probability equal to 0.

Testing the models

The testing dataset

In order to save some time, we used only a small fraction of the testing dataset for evaluating the accuracy of the different ngrams models (6,006 words, 231 lines). The accuracy of the various ngram models was evaluated predicting each word based on the 1/2/3/4 previous ones, according with the model used (2/3/4/5-gram model). The time required for performing each prediction was also measured.

        Accuracy 3w Accuracy 2w Accuracy 1w Coverage Perplexity Time (sec)
2-grams        0.15        0.13        0.09     0.79         NA    0.00376
3-grams        0.10        0.08        0.06     0.40         NA    0.00375
4-grams        0.03        0.03        0.03     0.14         NA    0.00762
5-grams        0.01        0.01        0.00     0.02         NA    0.00209

The table above summarizes the performance of the various models in terms of mean accuracy (accuracy on 3rd, 2nd and 1st word), testing dataset coverage and prediction time (in seconds). Perplexity was not calculated because of the existence of ngrams not observed in the training dataset (coverage<1). Perplexity (\(PP\)) is the inverse probability of the test set (normalized for the number of words), therefore unobserved ngrams with probability equal to 0 imply an infinite value for perplexity. In the next section we will illustrate some algorithms that combining different ngrams models are able to account for the probability of ngrams not observed in the training dataset.

\(PP(W)=\sqrt[n]{\prod_{i=1}^{n}{\frac{1}{P(w_i|w_{i-n+1}...w_{i-1})}}}\)

Linear interpolation, smoothing and back-off algorithms.

Linear interpolation, smoothing and backoff are used to calculate the probability for ngrams not observed in the training dataset.

Linear interpolation of the 1-gram, 2-gram and 3-gram models.

The idea behind linear interpolation is to combine the values of probabilities from the different n-grams models using a set of weights \(w_1\), \(w_2\) and \(w_3\) (where \(w_1\) corresponds to the weight assigned to probabilities from the 1-gram model, \(w_2\) to the 2-gram model and so on).

\(P_{int}(w_n|w_{n-1}w_{n-2})=w_1P(w_n)+w_2P(w_n|w_{n-1})+w_3P(w_n|w_{n-1}w_{n-2})\)

\(w_{1}+w_{2}+w_{3}=1\)

We generated 500 random combinations of 3 values between 0.0000001 and 1, then tested them against a subset of the training datasets. The combination providing the highest predictive power was then used to interpolate probabilites from the 1-gram, 2-gram and 3-gram models, evaluated against the testing dataset.

Out of the 500 weights combinations tested 269 achieves the highest accuracy of 0.21 on the third word predicted. We are going to use the first of highest scoring combination (max_Accuracy_weights[1,c("w1","w2","w3")]) to evaluate the efficacy of linear interpolation between the 1-, 2- and 3-gram models. The code below shows how linear interpolation works (results presented in a table format).

# Inputs

final_weights <- max_Accuracy_weights[1,c("w1","w2","w3")]

text <- tolower("I will be")

words <- unlist(strsplit(tolower(text)," "))

# Probabilities calculated using the 1- 2- and 3-gram models

p_trigrams <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1], words[2], sep=" ") &
                            ngrams_3_freq['word3']==words[3], "P"]
p_bigrams <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] &
                           ngrams_2_freq['word2']==words[3], "P"]
p_unigrams <- 1/length(dictionary)

# Weighted probabilities for linear interpolation 

p_unigrams_w <- p_unigrams * final_weights[1]
p_bigrams_w <- p_bigrams * final_weights[2]
p_trigrams_w <- p_trigrams * final_weights[3]

# Final probability

final_p <- p_unigrams_w + p_bigrams_w + p_trigrams_w

# Output table

values <- unname(c(p_unigrams,p_bigrams,p_trigrams,final_weights,
                   p_unigrams_w,p_bigrams_w,p_trigrams_w,
                   NA,NA,final_p))
matrix <- matrix(values, ncol=3, byrow=TRUE)
colnames(matrix) <- c("1-gram model","2-gram model","3-gram model")
rownames(matrix) <- c("P","weights","weighted P","total P")
round(matrix,3)

           1-gram model 2-gram model 3-gram model
P                 0.000        0.302        0.256
weights           0.131        0.176        0.694
weighted P        0.000        0.053        0.177
total P              NA           NA        0.230

The table below shows the performance of linear interpolation of 1-, 2- and 3-gram models on the testing dataset.

            Accuracy 3w Accuracy 2w Accuracy 1w Coverage Perplexity Time (sec)
Linear int.       0.198       0.163       0.119        1       2414    0.03233

Laplace smoothing

Laplace smoothing calculates the probability for unseen ngrams by adding one (or \(k\)) unit to the count of all the possible ngrams (both observed and unobserved). In this way the the associated probability for unobserved ngrams increases from 0 to \(1/N\) (where N is the number of unique ngrams observed in the training dataset). The formulas for calculating the ngrams probability according to the Maximum Likelihood Exitimation is adapted accordingly and presented below (for the 2-gram model).

\(P(w_{n}|w_{n-1}) = \frac{C(w_{n-1}w_n)+1}{C(w_{n-1})+V}\)

Where \(V\) is the length of the dictionary of individual words (1-grams).

Below there is the imprementation with R code, showing the p associated with the 2-gram ‘will be’.

aggregate_2_grams <- aggregate(ngrams_2_freq$freq, by=list(Category=ngrams_2_freq$word1), FUN=sum)

get_bigrams_p_laplace_smoothing <- function(x) {
  
  words <- unlist(strsplit(tolower(x)," ")[1])
  
  if (length(words)>1 & (!"" %in% words)) {
    
    head_word <- words[(length(words)-1)]
    tail_word <- words[length(words)]
    all_gram_length <- length(ngrams_1_freq$ngram)
    all_bigrams <- paste(head_word,ngrams_1_freq$ngram, sep=" ")
    
    #Named vector of unobserved 2-grams

    freq_0_bigrams <- all_bigrams[!all_bigrams %in% ngrams_2_freq$bigram]
    freq_0_bigrams_named_vector <- rep(0,length(freq_0_bigrams))
    names(freq_0_bigrams_named_vector) <- freq_0_bigrams

    #Named vector of observed 2-grams
    
    freq_non_0_bigrams <- as.character(ngrams_2_freq[ngrams_2_freq['word1']==head_word, 'bigram'])
    freq_non_0_bigrams_vector <- ngrams_2_freq[ngrams_2_freq['word1']==head_word, 'freq']
    names(freq_non_0_bigrams_vector) <- freq_non_0_bigrams

    #Calculate the named vector of sorted probabilities according to the MLE extimation with
    #add-1 smoothing (Lapace smoothing)
    freq_0_bigrams_named_vector_p <- (freq_0_bigrams_named_vector + 1)/all_gram_length
    freq_non_0_bigrams_vector_p <- (freq_non_0_bigrams_vector+1)/
                                   (aggregate_2_grams[aggregate_2_grams['Category']==head_word,'x'] + 
                                   all_gram_length)

    final_p <- sort(c(freq_0_bigrams_named_vector_p,
                      freq_non_0_bigrams_vector_p),decreasing=TRUE)
    
    if (!is.na(final_p[paste(head_word, tail_word, sep=" ")])) {
      perplexity <- 1/final_p[paste(head_word, tail_word, sep=" ")] 
    } else {
      perplexity <- length(dictionary)
    }
    
    list(final_p, perplexity)

  } else {
    list(NA,NA)
  }
}

example <- "will be"

results <- get_bigrams_p_laplace_smoothing(example)

results[[1]]['will be']

  will be 
0.2138484

           Accuracy 3w Accuracy 2w Accuracy 1w Coverage Perplexity Time (sec)
Laplace 5g       0.076       0.055       0.037    0.893         77    0.00679
Laplace 4g       0.094       0.074       0.056    0.907        105    0.01550
Laplace 3g       0.147       0.118       0.081    0.914         70    0.01117
Laplace 2g       0.166       0.136       0.093    0.938        371    0.01368

Stupid backoff algorithm

Backoff algorithms resort to lower ngrams models when they encounter a combination of words with an associated probability of 0 (i.e. not present in the training dataset). For example, if the probability for the 3-gram “I will be” is 0, then the algorithm will consider the probability of the 2-gram “will be” and so on. The simplest backoff is called “Stupid backoff”. This approach discounts 0.4 factor for the probability of any lower-lever n-gram model. An example is shown here below (formula and R implementation).

\(P(w_i|w_{i-n+1}...w_{i-1})=\begin{cases}\frac{C(w_iw_{i-n+1}...w_{i-1})}{C(w_{i-n+1}...w_{i-1})},& \text{if } C(w_iw_{i-n+1}...w_{i+1})> 0\\0.4*P_{bo}(w_i|w_{i-n+2}...w_{i-1})& \text{otherwise }\end{cases}\)

Where \(P_{bo}\) represents the probability associated with the lower level n-gram. This process can be repeated until the algorythm encounters \(P_{bo}>0\) or the lower level ngram model is reached (the 1-gram model, where each word is associated with a probability equal to \(1/V\) - where \(V\) is the length of the 1-gram dictionary). The code below shows the steps required for calculating the probability associated with the 5-grams “thank you so much for” and “I will be a mechanic” using the SBO algorithm.

# process words

example1 <- "Thank you so much for this"
example2 <- "I will be a mechanic"

stupid_back_off <- function(x) {

  words <- strsplit(tolower(x)," ")[[1]]
  
  #calculates probability based on 5-gram model
  
  p <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(words[1:4], collapse=" ") &
                     ngrams_5_freq['word5'] == words[5], 'P']
  
  #back-off to 4-gram model
  
  if (length(p)==0) {
    p <- 0.4*ngrams_4_freq[ngrams_4_freq['trigram']==paste(words[2:4], collapse=" ") &
                           ngrams_4_freq['word4'] == words[5], 'P']
  }
  
  #back-off to 3-gram model
  
  if (length(p)==0) {
    p <- 0.4*0.4*ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[3:4], collapse=" ") &
                 ngrams_3_freq['word3'] == words[5], 'P']
  }
  
  #back-off to 2-gram model
  
  if (length(p)==0) {
    p <- 0.4*0.4*0.4*ngrams_2_freq[ngrams_2_freq['word1']==words[4] &
                                   ngrams_2_freq['word2'] == words[5], 'P']
  }
  
  #back-off to 1-gram model
  
  if (length(p)==0) {
    p <- 0.4*0.4*0.4*0.4*(1/length(dictionary))
  }

  p
}

stupid_back_off(example1)

[1] 0.7083333

stupid_back_off(example2)

[1] 3.139178e-06

When tested on the training dataset, this is the performance of the Stupid back-off algorithm on the 5-, 4-, 3-, 2- and 1-gram models.

          Accuracy 3w Accuracy 2w Accuracy 1w Coverage Perplexity Time (sec)
SBO 5->1g       0.164       0.135       0.094    0.756      23587    0.01585
SBO 3->1g       0.184       0.151       0.105    0.794       2034    0.00760

Katz back-off algorithm

A more advanced version of back-off is the Katz back-off algorithm. In this case a certain amount of probability is discounted from the higher level ngram model and re-distributed to the lower level to account for unobserved ngrams not included in the training dataset. This solution is more matematically correct because the total probability associated with the enter set of ngrams models is set equal to 1. For example, discounting 0.2 from the 3-gram model will results in 0.8 total probability remaining and 0.2 redistribuited to the 2-gram model. This process is repeated as described for every lower ngram model. The percentage of probability to be discounted can be defined by the user, extimated by via simulations or using the Good-Turing method.[Good-Turing frequency extimation] (https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation) is commonly used for calculating the frequency of unobserved events and it is implemented in the EdgeR package via the goodTuring function. An example is presented here below (formula and R implementation with the ‘i will be’ 3-gram).

\(P(w_i|w_{i-n+1}...w_{i-1})=\begin{cases}d_{w_{i-n+1}...w_{i}}\frac{C(w_iw_{i-n+1}...w_{i-1})}{C(w_{i-n+1}...w_{i-1})},& \text{if } C(w_iw_{i-n+1}...w_{i+1})> k\\\alpha_{w_{i-n+1}...w_{i}}P_{bo}(w_i|w_{i-n+2}...w_{i-1})& \text{otherwise }\end{cases}\)

where

\(P_{bo}\) is the probability associated with the lower level ngram model
\(d_{w_{i-n+1}...w_{i}}\) is the discount associated with the curren ngram model
\(k\) is the probability threshold for moving to the lower ngram model (generally equal to 0)
\(\alpha_{w_{i-n+1}}\) is the back-off weight computed as follows

\(\alpha_{w_{i-n+1}...w_{i}}=\frac{\beta_{w_{1-n+1}...w_{i-1}}}{\sum_{w_i:C(w_{i-n+1}...w_i)\leq k}P_{bo}(w_i|w_{i-n+2}...w_{i-1})}\)

with

\(\beta_{w_{1-n+1}...w_{i-1}} = 1 - \sum_{w_i:C(w_{i-n+1}...w_i)>k}d_{w_{i-n+1}...w_{i}}\frac{C(w_{i-n+1}...w_{i-1}w_i)}{C(w_{i-n+1}...w_{i-1})}\)

# process words

library(edgeR)

#calculate discounts based on the Good-Turing extimation for unobserved ngrams

discount_5_gram <- sum(goodTuring(ngrams_5_freq$freq)['proportion'][[1]])
discount_4_gram <- sum(goodTuring(ngrams_4_freq$freq)['proportion'][[1]])
discount_3_gram <- sum(goodTuring(ngrams_3_freq$freq)['proportion'][[1]])
discount_2_gram <- sum(goodTuring(ngrams_2_freq$freq)['proportion'][[1]])

example <- "i will be"

words <- unlist(strsplit(example," "))

#discounted 3-gram probabilites

p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1:2],collapse=" "), 'P'] * discount_3_gram
tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1:2],collapse=" "), 'word3']
names(p_3_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_3_gram)
discounted_P_3_gram <- 1 - sum(p_3_gram)

#discounted 2-gram probabilites

p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] &
                          !ngrams_2_freq$word2 %in% tail_word_3_gram, 'P']
p_2_gram <- p_2_gram * discounted_P_3_gram/sum(p_2_gram) * discount_2_gram
tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] & 
                                  !ngrams_2_freq$word2 %in% tail_word_3_gram, 'word2']
names(p_2_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_2_gram, sep=" ")
discounted_P_2_gram <- 1 - sum(p_2_gram) - sum(p_3_gram)

#discounted 1-gram probabilites

tail_word_1_gram <- dictionary[!dictionary %in% tail_word_3_gram &
                               !dictionary %in% tail_word_2_gram]
p_1_gram <- 1/length(dictionary) * discounted_P_2_gram/sum(1/length(dictionary) * length(tail_word_1_gram))
p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))
names(p_1_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_1_gram)

#final probability vector

p_final <- c(p_3_gram, p_2_gram, p_1_gram)
p_final['i will be']

 i will be 
0.02666113

The performance of the algorithm implementing the Katz backoff on the training dataset is summarized in the table here below.

          Accuracy 3w Accuracy 2w Accuracy 1w Coverage Perplexity Time (sec)
KBO 5->1g       0.149       0.125       0.081    0.756        959    0.01677
KBO 3->1g       0.167       0.140       0.094    0.794        611    0.00819

Conclusions

The table below compares the performances of individual models and algorithms based on linear interpolation, Laplace smoothing, Stupid and Katz backoff. Linear interpolation was performed to merge together the probability distributions of 3-, 2- and 1-grams, whereas Laplace smoothing was applied to all the models. The backoff algorythms were used to compute the probabilites backing-off from the 5-gram and the 3-gram models down to 1-grams (5->1-gram and 3->1-gram).

            Accuracy.3w Accuracy.2w Accuracy.1w Coverage Perplexity Time..sec.
2-grams           0.150       0.130       0.090    0.790         NA    0.00376
3-grams           0.100       0.080       0.060    0.400         NA    0.00375
4-grams           0.030       0.030       0.030    0.140         NA    0.00762
5-grams           0.010       0.010       0.000    0.020         NA    0.00209
Linear int.       0.198       0.163       0.119    1.000       2414    0.03233
Laplace 5g        0.076       0.055       0.037    0.893         77    0.00679
Laplace 4g        0.094       0.074       0.056    0.907        105    0.01550
Laplace 3g        0.147       0.118       0.081    0.914         70    0.01117
Laplace 2g        0.166       0.136       0.093    0.938        371    0.01368
SBO 5->1g         0.164       0.135       0.094    0.756      23587    0.01585
SBO 3->1g         0.184       0.151       0.105    0.794       2034    0.00760
KBO 5->1g         0.149       0.125       0.081    0.756        959    0.01677
KBO 3->1g         0.167       0.140       0.094    0.794        611    0.00819

The results above shows that the integrating different n-gram models clearly improves prediction accuracy. Linear interpolation of the 3-, 2- and 1-gram model achieves the highest score in terms of accuracy, whereas the Kats back-off algorithm scores the highest in terms of perplexity (with a considerably lower coverage, only slightly better compared with the much simplier 2-gram model). As expected, the individual n-gram models are considerably faster compared with more complex algorythm, however the lower accuracy does not justify their use in the final shinyapp. Surprisingly, the Stupid back-off algorithm has an accuracy superior to the Katz backoff, but with an higher perplexity. Back-off algorythm including 5 models (from 5- to 1-gram models) are considerably slower than those including only 3 models (from 3- to 1-gram models), without a clear improvement in terms of accuracy and perplexity. Laplace smoothing is not brilliant neither in terms of performance nor in terms of accuracy.

The linear interpolation, Stupid and Katz back-off algorithms were finally selected for developing the shinyapp Pre3d1ct0, already available at the website filippo-ciceri.shinyapps.io/Pr3d1ct0/. A quick introduction on Pr3d1ct0 is also available here.

Filippo Ciceri

Appendix

Exploratory data analysis

data_blogs <- readLines("Coursera-SwiftKey-en_US\\en_US.blogs.txt")
number_lines_blogs <- format(length(data_blogs), nsmall=0, big.mark=',')
line_length_blogs <- unlist(lapply(data_blogs, nchar))
max_line_length_blogs <- format(max(line_length_blogs), nsmall=0, big.mark=',')
min_line_length_blogs <- format(min(line_length_blogs), nsmall=0, big.mark=',')
mean_line_length_blogs <- format(mean(line_length_blogs), nsmall=3, big.mark=',')

text_blogs <- paste(data_blogs, collapse=" ")

all_words_blogs <- strsplit(text_blogs, " ")[[1]]
characters_only_blogs <- all_words_blogs[grep("^[a-zA-Z]+$",all_words_blogs)]
digits_only_blogs <- all_words_blogs[grep("^[0-9]+$",all_words_blogs)]
characters_and_digits_blogs <- all_words_blogs[grep("^[a-zA-Z][0-9]+$", all_words_blogs)]
x <- all_words_blogs[grep("[a-zA-Z][^0-9]",all_words_blogs)]
characters_and_punctuations_blogs <- x[!(x %in% characters_only_blogs)]
y <- all_words_blogs[grep("[0-9]+",all_words_blogs)]
digits_and_punctuations_blogs <- y[!(y %in% digits_only_blogs)]
rm(x)
rm(y)

characters_and_punctuations_blogs <- gsub("â???T","'",characters_and_punctuations_blogs)
characters_and_punctuations_blogs <- gsub("[â???\\W\\W]","",characters_and_punctuations_blogs)
characters_and_punctuations_blogs <- gsub("^\\W|\\W$","",characters_and_punctuations_blogs)
characters_only_blogs <- c(characters_only_blogs, gsub("^\\W|\\W$","",characters_and_punctuations_blogs))

all_words_blogs_n <- format(length(all_words_blogs), nsmall=0, big.mark=',')
characters_only_blogs <- format(length(characters_only_blogs), nsmall=0, big.mark=',')
digits_only_blogs <- format(length(digits_only_blogs), nsmall=0, big.mark=',')
characters_and_punctuations_blogs <- format(length(characters_and_punctuations_blogs), nsmall=0, big.mark=',')
digits_and_punctuations_blogs <- format(length(digits_and_punctuations_blogs), nsmall=0, big.mark=',')

rm(text_blogs)


data_news <- readLines("Coursera-SwiftKey-en_US\\en_US.news.txt")
number_lines_news <- format(length(data_news), nsmall=0, big.mark=',')
line_length_news <- unlist(lapply(data_news, nchar))
max_line_length_news <- format(max(line_length_news), nsmall=0, big.mark=',')
min_line_length_news <- format(min(line_length_news), nsmall=0, big.mark=',')
mean_line_length_news <- format(mean(line_length_news), nsmall=3, big.mark=',')

text_news <- paste(data_news, collapse=" ")

all_words_news <- strsplit(text_news, " ")[[1]]
characters_only_news <- all_words_news[grep("^[a-zA-Z]+$",all_words_news)]
digits_only_news <- all_words_news[grep("^[0-9]+$",all_words_news)]
characters_and_digits_news <- all_words_news[grep("^[a-zA-Z][0-9]+$", all_words_news)]
x <- all_words_news[grep("[a-zA-Z][^0-9]",all_words_news)]
characters_and_punctuations_news <- x[!(x %in% characters_only_news)]
y <- all_words_news[grep("[0-9]+",all_words_news)]
digits_and_punctuations_news <- y[!(y %in% digits_only_news)]
rm(x)
rm(y)

characters_and_punctuations_news <- gsub("â???T","'",characters_and_punctuations_news)
characters_and_punctuations_news <- gsub("[â???\\W\\W]","",characters_and_punctuations_news)
characters_and_punctuations_news <- gsub("^\\W|\\W$","",characters_and_punctuations_news)
characters_only_news <- c(characters_only_news, gsub("^\\W|\\W$","",characters_and_punctuations_news))

all_words_news_n <- format(length(all_words_news), nsmall=0, big.mark=',')
characters_only_news <- format(length(characters_only_news), nsmall=0, big.mark=',')
digits_only_news <- format(length(digits_only_news), nsmall=0, big.mark=',')
characters_and_punctuations_news <- format(length(characters_and_punctuations_news), nsmall=0, big.mark=',')
digits_and_punctuations_news <- format(length(digits_and_punctuations_news), nsmall=0, big.mark=',')

rm(text_news)


data_twitter <- readLines("Coursera-SwiftKey-en_US\\en_US.twitter.txt")
number_lines_twitter <- format(length(data_twitter), nsmall=0, big.mark=',')
line_length_twitter <- unlist(lapply(data_twitter, nchar))
max_line_length_twitter <- format(max(line_length_twitter), nsmall=0, big.mark=',')
min_line_length_twitter <- format(min(line_length_twitter), nsmall=0, big.mark=',')
mean_line_length_twitter <- format(mean(line_length_twitter), nsmall=3, big.mark=',')

text_twitter <- paste(data_twitter, collapse=" ")

all_words_twitter <- strsplit(text_twitter, " ")[[1]]
characters_only_twitter <- all_words_twitter[grep("^[a-zA-Z]+$",all_words_twitter)]
digits_only_twitter <- all_words_twitter[grep("^[0-9]+$",all_words_twitter)]
characters_and_digits_twitter <- all_words_twitter[grep("^[a-zA-Z][0-9]+$", all_words_twitter)]
x <- all_words_twitter[grep("[a-zA-Z][^0-9]",all_words_twitter)]
characters_and_punctuations_twitter <- x[!(x %in% characters_only_twitter)]
y <- all_words_twitter[grep("[0-9]+",all_words_twitter)]
digits_and_punctuations_twitter <- y[!(y %in% digits_only_twitter)]
rm(x)
rm(y)

characters_and_punctuations_twitter <- gsub("â???T","'",characters_and_punctuations_twitter)
characters_and_punctuations_twitter <- gsub("[â???\\W\\W]","",characters_and_punctuations_twitter)
characters_and_punctuations_twitter <- gsub("^\\W|\\W$","",characters_and_punctuations_twitter)
characters_only_twitter <- c(characters_only_twitter, gsub("^\\W|\\W$","",characters_and_punctuations_twitter))

all_words_twitter_n <- format(length(all_words_twitter), nsmall=0, big.mark=',')
characters_only_twitter <- format(length(characters_only_twitter), nsmall=0, big.mark=',')
digits_only_twitter <- format(length(digits_only_twitter), nsmall=0, big.mark=',')
characters_and_punctuations_twitter <- format(length(characters_and_punctuations_twitter), nsmall=0, big.mark=',')
digits_and_punctuations_twitter <- format(length(digits_and_punctuations_twitter), nsmall=0, big.mark=',')

rm(text_twitter)

Subsetting and filtering the dataset

Generating the training and testing dataset

set.seed(1234)
training_dataset <- c(data_news[sample(length(data_blogs), 0.4*length(data_blogs), replace=FALSE)],
                      data_news[sample(length(data_news), 0.4*length(data_news), replace=FALSE)],
                      data_twitter[sample(length(data_twitter), 0.4*length(data_twitter), replace=FALSE)])

testing_dataset <- c(data_blogs[!data_blogs %in% training_dataset],
                     data_news[!data_news %in% training_dataset],
                     data_twitter[!data_twitter %in% training_dataset])

training_dataset <- training_dataset[!is.na(training_dataset)]
testing_dataset <- testing_dataset[!is.na(testing_dataset)]

rm(data_news)
rm(data_twitter)
rm(data_blogs)
rm(all_words_blogs)
rm(all_words_news)
rm(all_words_twitter)

Filtering the training dataset

training_dataset <- tolower(gsub("[^a-zA-Z' ]","",training_dataset))
training_dataset <- gsub("( ') | (' ) | ^' | '$","",training_dataset)
text_training <- paste(training_dataset, collapse=" ")
all_words <- regmatches(text_training,gregexpr("[a-z']+",text_training))[[1]]
all_words_n_1 <- format(length(all_words), nsmall=0, big.mark=',')


prophanities <- readLines("prophanities_list.txt")
prophanities <- gsub('.{3}$', '', prophanities)
all_words <- all_words[!all_words %in% prophanities]
all_words_n_2 <- format(length(all_words), nsmall=0, big.mark=',')


word_frequency_df <- data.frame(sort(table(all_words), decreasing =TRUE))
word_frequency_df['P'] <- word_frequency_df$Freq / sum(word_frequency_df$Freq)
word_frequency_df['cum_P'] <- cumsum(word_frequency_df$P)

# plot(word_frequency_df[word_frequency_df['cum_P']<0.9,'Freq'],
#      ylim=c(0,100000),
#      ylab='Freq in all_words',
#      xlab='Index of words providing 90% dataset coverage')

coverage_50 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.5,])[1]
coverage_90 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.9,])[1]
coverage_95 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.95,])[1]

coverage_50 <- format(coverage_50, nsmall=0, big.mark=',')
coverage_90 <- format(coverage_90, nsmall=0, big.mark=',')
coverage_95 <- format(coverage_95, nsmall=0, big.mark=',')


word_frequency_df[1:10,c("all_words","Freq")]

Filtering non-english words (OOV words)

english_dict <- readLines('english_dict_10000.csv')
english_dict <- tolower(english_dict)
english_dict <- unique(english_dict[grep("[a-zA-Z']+",english_dict)])

p_OOV <- mean(!all_words %in% english_dict)*100
p_OOV <- format(round(as.numeric(p_OOV),2), nsmall=2)

all_words <- all_words[all_words %in% english_dict]

word_frequency_df <- data.frame(sort(table(all_words), decreasing =TRUE))
word_frequency_df['P'] <- word_frequency_df$Freq / sum(word_frequency_df$Freq)
word_frequency_df['cum_P'] <- cumsum(word_frequency_df$P)

# plot(word_frequency_df[word_frequency_df['cum_P']<0.9,'Freq'],
#      ylim=c(0,100000),
#      ylab='Freq in all_words',
#      xlab='Index of words providing 90% dataset coverage')

coverage_50 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.5,])[1]
coverage_90 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.9,])[1]
coverage_95 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.95,])[1]

coverage_50 <- format(coverage_50, nsmall=0, big.mark=',')
coverage_90 <- format(coverage_90, nsmall=0, big.mark=',')
coverage_95 <- format(coverage_95, nsmall=0, big.mark=',')

# word_frequency_df[word_frequency_df['all_words']=='agreeing',]

Stemming the dataset

library(SnowballC)
all_words_stemmed <- wordStem(all_words, language = "en")

word_frequency_df_stemmed <- data.frame(sort(table(all_words_stemmed), decreasing =TRUE))
word_frequency_df_stemmed['P'] <- word_frequency_df_stemmed$Freq/sum(word_frequency_df_stemmed$Freq)
word_frequency_df_stemmed['cumP'] <- cumsum(word_frequency_df_stemmed$P)

coverage_50_stemmed <- word_frequency_df_stemmed[word_frequency_df_stemmed['cumP']<=0.50,]
coverage_90_stemmed <- word_frequency_df_stemmed[word_frequency_df_stemmed['cumP']<=0.90,]
coverage_95_stemmed <- word_frequency_df_stemmed[word_frequency_df_stemmed['cumP']<=0.95,]

coverage_50_stemmed <- format(dim(coverage_50_stemmed)[1], nsmall=0, big.mark=',')
coverage_90_stemmed <- format(dim(coverage_90_stemmed)[1], nsmall=0, big.mark=',')
coverage_95_stemmed <- format(dim(coverage_95_stemmed)[1], nsmall=0, big.mark=',')

rm(all_words_stemmed)
rm(word_frequency_df_stemmed)

Building the n-gram models

1-gram model

ngrams_1 <- all_words
ngrams_1_freq <- word_frequency_df
names(ngrams_1_freq) <- c("ngram","Freq","P","cum_P")
ngrams_1_freq['ngram'] <- as.character(ngrams_1_freq$ngram)

dictionary <- ngrams_1_freq$ngram

2-gram model

text_training <- tolower(text_training)
ngrams_2 <- regmatches(text_training, gregexpr("([a-z']+ [a-z']+)", text_training))[[1]]
ngrams_2 <- ngrams_2[lapply(ngrams_2,length)>0]

total_ngrams_2 <- format(length(ngrams_2), big.mark=',')
unique_ngrams_2 <- format(length(unique(ngrams_2)), big.mark=',')


ngrams_2_2 <- matrix(unlist(strsplit(ngrams_2," ")), ncol=2, byrow=TRUE)

ngrams_2 <- ngrams_2[(!ngrams_2_2[,1] %in% prophanities)&
                     (!ngrams_2_2[,2] %in% prophanities)&
                     (ngrams_2_2[,1] %in% english_dict)&
                     (ngrams_2_2[,2] %in% english_dict)]

rm(ngrams_2_2)

total_ngrams_2 <- format(length(ngrams_2), big.mark=',')
unique_ngrams_2 <- format(length(unique(ngrams_2)), big.mark=',')


ngrams_2_freq <- data.frame(table(ngrams_2))
names(ngrams_2_freq) <- c("bigram","freq")

ngrams_2_freq['bigram'] <- as.character(ngrams_2_freq$bigram)
ngrams_2_freq['word1'] <- sapply(strsplit(ngrams_2_freq$bigram," "), `[`, 1)
ngrams_2_freq['word2'] <- sapply(strsplit(ngrams_2_freq$bigram," "), `[`, 2)

ngrams_2_freq <- ngrams_2_freq[,c('bigram','word1','word2','freq')]
ngrams_2_freq <- ngrams_2_freq[order(ngrams_2_freq$freq,decreasing = TRUE),]
ngrams_2_freq['cum_freq'] <- cumsum(ngrams_2_freq$freq)/sum(ngrams_2_freq$freq)

plot(ngrams_2_freq[ngrams_2_freq['cum_freq']<0.9,'freq'],
     ylim=c(0,1000),
     ylab='Freq in training dataset',
     xlab='Index of 2-grams providing 90% dataset coverage',
     type='l',
     main='2-grams sorted based on frequency')

coverage_50 <- ngrams_2_freq[ngrams_2_freq['cum_freq']<=0.5,]
coverage_90 <- ngrams_2_freq[ngrams_2_freq['cum_freq']<=0.9,]
coverage_95 <- ngrams_2_freq[ngrams_2_freq['cum_freq']<=0.95,]

coverage_50 <- format(dim(coverage_50)[1], nsmall=0, big.mark=',')
coverage_90 <- format(dim(coverage_90)[1], nsmall=0, big.mark=',')
coverage_95 <- format(dim(coverage_95)[1], nsmall=0, big.mark=',')


ngrams_2_freq[1:10,c("bigram","freq")]


pruning <- 1:10
ngrams_2_number <- sapply(pruning, FUN=function(x) {
    sum(ngrams_2_freq[ngrams_2_freq['freq']<x,'freq'])
})
compression <- sapply(ngrams_2_number, FUN=function(x) {
    x/length(ngrams_2)
})
df_p <- data.frame(pruning,ngrams_2_number,compression)

df_p[,'pruning'] <- sapply(df_p[,'pruning'], FUN=function(x){
  paste("<",as.character(x))
})
names(df_p) <- c("freq 2-grams removed","N","reduction")

df_p


ngrams_2_freq <- ngrams_2_freq[ngrams_2_freq['freq']>3,]


# here below there is my first approach to calculate 2-gram probabilities
# it is not incorrect per se, but it generate problems when the pruning process removes different
# ngrams from the 2-, 3-, 4- and 5-grams models.
# 
# bigram_matrix <- merge(ngrams_2_freq[,c('word1','word2','freq')], 
#                        ngrams_1_freq[,c('ngram','Freq')], 
#                        by.x="word1", 
#                        by.y="ngram", 
#                        all=TRUE)
# 
# the new approach deals better with the consequences of the pruning process

aggregate_1_gram <- aggregate(ngrams_2_freq$freq, by=list(Category=ngrams_2_freq$word1), FUN=sum)
ngrams_2_freq <- merge(ngrams_2_freq,
                       aggregate_1_gram,
                       by.x='word1',
                       by.y='Category')
names(ngrams_2_freq)[6] <- c('word1_freq')
ngrams_2_freq['P'] <- ngrams_2_freq['freq']/ngrams_2_freq['word1_freq']

# dim(ngrams_2_freq[is.na(ngrams_2_freq['word2']),])
# dim(ngrams_2_freq[is.na(ngrams_2_freq['word1']),])
# ngrams_2_freq[is.na(ngrams_2_freq['P']),'P']

size_bigram_freq <- format(object.size(ngrams_2_freq[,c(1:5)]), units = "Mb", standard = "auto", digits = 1L)
size_bigram_P <- format(object.size(ngrams_2_freq), units = "Mb", standard = "auto", digits = 1L)

bigram_matrix <- acast(ngrams_2_freq, word1~word2, value.var="P")
# sum(bigram_matrix_2[is.na(bigram_matrix_2)])

size_bigram_matrix <- format(object.size(bigram_matrix), units = "Mb", standard = "auto", digits = 1L)

# table(ngrams_2)['i will']/table(ngrams_1)['i']
# ngrams_2_freq[ngrams_2_freq['word1']=='i'& ngrams_2_freq['word2']=='will','P']
# sum(ngrams_2_freq[ngrams_2_freq['word1']=='i','P'])
# bigram_matrix['i','will']

3-gram model

text_training <- tolower(text_training)
ngrams_3 <- regmatches(text_training,gregexpr("([a-z']+ [a-z']+ [a-z']+)",text_training))[[1]]
ngrams_3 <- ngrams_3[!is.na(ngrams_3)]

total_ngrams_3 <- format(length(ngrams_3), big.mark=',')
unique_ngrams_3 <- format(length(unique(ngrams_3)), big.mark=',')


ngrams_3_2 <- matrix(unlist(strsplit(ngrams_3," ")), ncol=3, byrow=TRUE)

ngrams_3 <- ngrams_3[(!ngrams_3_2[,1] %in% prophanities)&
                     (!ngrams_3_2[,2] %in% prophanities)&
                     (!ngrams_3_2[,3] %in% prophanities)&
                     (ngrams_3_2[,1] %in% english_dict)&
                     (ngrams_3_2[,2] %in% english_dict)&
                     (ngrams_3_2[,3] %in% english_dict)]

rm(ngrams_3_2)

total_ngrams_3 <- format(length(ngrams_3), big.mark=',')
unique_ngrams_3 <- format(length(unique(ngrams_3)), big.mark=',')


ngrams_3_freq <- data.frame(table(sapply(ngrams_3, paste, collapse = " ")))

names(ngrams_3_freq) <- c("trigram","freq")

ngrams_3_freq['trigram'] <- as.character(ngrams_3_freq$trigram)
ngrams_3_freq['word1'] <- sapply(strsplit(as.character(ngrams_3_freq$trigram)," "), `[`, 1)
ngrams_3_freq['word2'] <- sapply(strsplit(as.character(ngrams_3_freq$trigram)," "), `[`, 2)
ngrams_3_freq['word3'] <- sapply(strsplit(as.character(ngrams_3_freq$trigram)," "), `[`, 3)

ngrams_3_freq <- ngrams_3_freq[,c('trigram','word1','word2','word3','freq')]
ngrams_3_freq <- ngrams_3_freq[order(ngrams_3_freq$freq,decreasing = TRUE),]

ngrams_3_freq['cum_freq'] <- cumsum(ngrams_3_freq$freq)/sum(ngrams_3_freq$freq)

plot(ngrams_3_freq[ngrams_3_freq['cum_freq']<0.9,'freq'],
     ylim=c(0,200),
     ylab='Freq in training dataset',
     xlab='Index of 3-grams providing 90% dataset coverage',
     type='l',
     main='3-grams sorted based on frequency')

coverage_50 <- ngrams_3_freq[ngrams_3_freq['cum_freq']<=0.5,]
coverage_90 <- ngrams_3_freq[ngrams_3_freq['cum_freq']<=0.9,]
coverage_95 <- ngrams_3_freq[ngrams_3_freq['cum_freq']<=0.95,]

coverage_50 <- format(dim(coverage_50)[1], nsmall=0, big.mark=',')
coverage_90 <- format(dim(coverage_90)[1], nsmall=0, big.mark=',')
coverage_95 <- format(dim(coverage_95)[1], nsmall=0, big.mark=',')


ngrams_3_freq[1:10,c("trigram","freq")]


pruning <- 1:10
ngrams_3_number <- sapply(pruning, FUN=function(x) {
    sum(ngrams_3_freq[ngrams_3_freq['freq']<x,'freq'])
})
compression <- sapply(ngrams_3_number, FUN=function(x) {
    x/length(ngrams_3)
})
df_p <- data.frame(pruning,ngrams_3_number,compression)

df_p[,'pruning'] <- sapply(df_p[,'pruning'], FUN=function(x){
  paste("<",as.character(x))
})
names(df_p) <- c("freq 3-grams removed","N","reduction")

df_p


ngrams_3_freq <- ngrams_3_freq[ngrams_3_freq['freq']>3,]


ngrams_3_freq['bigram'] <- paste(ngrams_3_freq$word1, ngrams_3_freq$word2, sep=" ")

#see comment in the 2-gram section regarding this
# trigram_matrix <- merge(ngrams_3_freq[,c('bigram','word3','freq')],
#                         ngrams_2_freq[,c('bigram','freq')],
#                         by.x="bigram",
#                         by.y="bigram",
#                         all=TRUE)

aggregate_2_gram <- aggregate(ngrams_3_freq$freq, by=list(Category=ngrams_3_freq$bigram), FUN=sum)
ngrams_3_freq <- merge(ngrams_3_freq,
                       aggregate_2_gram,
                       by.x='bigram',
                       by.y='Category')
names(ngrams_3_freq)[8] <- c('bigram_freq')
ngrams_3_freq['P'] <- ngrams_3_freq['freq']/ngrams_3_freq['bigram_freq']

# dim(ngrams_3_freq[is.na(ngrams_3_freq['word3']),])
# dim(ngrams_3_freq[is.na(ngrams_3_freq['word2']),])
# dim(ngrams_3_freq[is.na(ngrams_3_freq['word1']),])
# ngrams_3_freq[is.na(ngrams_3_freq['P']),'P']

size_trigram_freq <- format(object.size(ngrams_3_freq[,c(1:7)]), units = "Mb", standard = "auto", digits = 1L)
size_trigram_P <- format(object.size(ngrams_3_freq), units = "Mb", standard = "auto", digits = 1L)

trigram_matrix <- acast(ngrams_3_freq, bigram~word3, value.var="P")
# sum(trigram_matrix_2[is.na(trigram_matrix_2)])

size_trigram_matrix <- format(object.size(trigram_matrix), units = "Mb", standard = "auto", digits = 1L)

# table(ngrams_3)['thanks for the']/table(ngrams_2)['thanks for']
# ngrams_3_freq[ngrams_3_freq['bigram']=='thanks for' & ngrams_3_freq['word3']=='the','P']
# sum(ngrams_3_freq[ngrams_3_freq['bigram']=='thanks for','P'])
# trigram_matrix['thanks for','the']

4-gram model

text_training <- tolower(text_training)
ngrams_4 <- regmatches(text_training,gregexpr("([a-z']+ [a-z']+ [a-z']+ [a-z']+)",text_training))[[1]]
ngrams_4 <- ngrams_4[!is.na(ngrams_4)]

ngrams_4_2 <- matrix(unlist(strsplit(ngrams_4," ")), ncol=4, byrow=TRUE)

ngrams_4 <- ngrams_4[(!ngrams_4_2[,1] %in% prophanities)&
                     (!ngrams_4_2[,2] %in% prophanities)&
                     (!ngrams_4_2[,3] %in% prophanities)&
                     (!ngrams_4_2[,4] %in% prophanities)&
                     (ngrams_4_2[,1] %in% english_dict)&
                     (ngrams_4_2[,2] %in% english_dict)&
                     (ngrams_4_2[,3] %in% english_dict)&
                     (ngrams_4_2[,4] %in% english_dict)]

rm(ngrams_4_2)
ngrams_4 <- ngrams_4[!is.na(ngrams_4)]

total_ngrams_4 <- format(length(ngrams_4), big.mark=',')
unique_ngrams_4 <- format(length(unique(ngrams_4)), big.mark=',')

ngrams_4_freq <- data.frame(table(sapply(ngrams_4, paste, collapse = " ")))

names(ngrams_4_freq) <- c("tetragram","freq")

ngrams_4_freq['tetragram'] <- as.character(ngrams_4_freq$tetragram)
ngrams_4_freq['word1'] <- sapply(strsplit(as.character(ngrams_4_freq$tetragram)," "), `[`, 1)
ngrams_4_freq['word2'] <- sapply(strsplit(as.character(ngrams_4_freq$tetragram)," "), `[`, 2)
ngrams_4_freq['word3'] <- sapply(strsplit(as.character(ngrams_4_freq$tetragram)," "), `[`, 3)
ngrams_4_freq['word4'] <- sapply(strsplit(as.character(ngrams_4_freq$tetragram)," "), `[`, 4)

ngrams_4_freq <- ngrams_4_freq[,c('tetragram','word1','word2','word3','word4','freq')]
ngrams_4_freq <- ngrams_4_freq[order(ngrams_4_freq$freq,decreasing = TRUE),]

ngrams_4_freq['cum_freq'] <- cumsum(ngrams_4_freq$freq)/sum(ngrams_4_freq$freq)

coverage_50 <- ngrams_4_freq[ngrams_4_freq['cum_freq']<=0.5,]
coverage_90 <- ngrams_4_freq[ngrams_4_freq['cum_freq']<=0.9,]
coverage_95 <- ngrams_4_freq[ngrams_4_freq['cum_freq']<=0.95,]

coverage_50 <- format(dim(coverage_50)[1], nsmall=0, big.mark=',')
coverage_90 <- format(dim(coverage_90)[1], nsmall=0, big.mark=',')
coverage_95 <- format(dim(coverage_95)[1], nsmall=0, big.mark=',')

ngrams_4_freq[1:10,c("tetragram","freq")]
pruning <- 1:10
ngrams_4_number <- sapply(pruning, FUN=function(x) {
    sum(ngrams_4_freq[ngrams_4_freq['freq']<x,'freq'])
})
compression <- sapply(ngrams_4_number, FUN=function(x) {
    x/length(ngrams_4)
})
df_p <- data.frame(pruning,ngrams_4_number,compression)

df_p[,'pruning'] <- sapply(df_p[,'pruning'], FUN=function(x){
  paste("<",as.character(x))
})
names(df_p) <- c("freq 4-grams removed","N","reduction")

df_p

ngrams_4_freq <- ngrams_4_freq[ngrams_4_freq['freq']>1,]


ngrams_4_freq['trigram'] <- paste(ngrams_4_freq$word1, ngrams_4_freq$word2, ngrams_4_freq$word3, sep=" ")

#see comment in the 2-gram section regarding this
# tetragram_matrix <- merge(ngrams_4_freq[,c('trigram','word4','freq')],
#                           ngrams_3_freq[,c('trigram','freq')],
#                           by.x="trigram",
#                           by.y="trigram",
#                           all=TRUE)

aggregate_3_gram <- aggregate(ngrams_4_freq$freq, by=list(Category=ngrams_4_freq$trigram), FUN=sum)
ngrams_4_freq <- merge(ngrams_4_freq,
                       aggregate_3_gram,
                       by.x='trigram',
                       by.y='Category')
names(ngrams_4_freq)[9] <- c('trigram_freq')
ngrams_4_freq['P'] <- ngrams_4_freq['freq']/ngrams_4_freq['trigram_freq']

# dim(ngrams_4_freq[is.na(ngrams_4_freq['word4']),])
# dim(ngrams_4_freq[is.na(ngrams_4_freq['word3']),])
# dim(ngrams_4_freq[is.na(ngrams_4_freq['word2']),])
# dim(ngrams_4_freq[is.na(ngrams_4_freq['word1']),])
# ngrams_4_freq[is.na(ngrams_4_freq['P']),'P']

size_tetragram_freq <- format(object.size(ngrams_4_freq[,c(1:8)]), units = "Mb", standard = "auto", digits = 1L)
size_tetragram_P <- format(object.size(ngrams_4_freq), units = "Mb", standard = "auto", digits = 1L)

tetragram_matrix <- acast(ngrams_4_freq, trigram~word4, value.var="P")
# sum(tetragram_matrix[is.na(tetragram_matrix)])

size_tetragram_matrix <- format(object.size(tetragram_matrix),
                                units = "Mb", standard = "auto", digits = 1L)

# table(ngrams_4)['thanks for the follow']/table(ngrams_3)['thanks for the']
# ngrams_4_freq[ngrams_4_freq['trigram']=='thanks for the' & ngrams_4_freq['word4']=='follow','P']
# sum(ngrams_4_freq[ngrams_4_freq['trigram']=='thanks for the','P'])
# tetragram_matrix['thanks for the','follow']

5-gram model

text_training <- tolower(text_training)
ngrams_5 <- regmatches(text_training,gregexpr("([a-z']+ [a-z']+ [a-z']+ [a-z']+ [a-z']+)",text_training))[[1]]
ngrams_5 <- ngrams_5[!is.na(ngrams_5)]

ngrams_5_2 <- matrix(unlist(strsplit(ngrams_5," ")), ncol=5, byrow=TRUE)

ngrams_5 <- ngrams_5[(!ngrams_5_2[,1] %in% prophanities)&
                     (!ngrams_5_2[,2] %in% prophanities)&
                     (!ngrams_5_2[,3] %in% prophanities)&
                     (!ngrams_5_2[,4] %in% prophanities)&
                     (!ngrams_5_2[,5] %in% prophanities)&
                     (ngrams_5_2[,1] %in% english_dict)&
                     (ngrams_5_2[,2] %in% english_dict)&
                     (ngrams_5_2[,3] %in% english_dict)&
                     (ngrams_5_2[,4] %in% english_dict)&
                     (ngrams_5_2[,5] %in% english_dict)]

ngrams_5 <- ngrams_5[!is.na(ngrams_5)]
rm(ngrams_5_2)

total_ngrams_5 <- format(length(ngrams_5), big.mark=',')
unique_ngrams_5 <- format(length(unique(ngrams_5)), big.mark=',')

ngrams_5_freq <- data.frame(table(sapply(ngrams_5, paste, collapse = " ")))

names(ngrams_5_freq) <- c("pentagram","freq")

ngrams_5_freq['pentagram'] <- as.character(ngrams_5_freq$pentagram)
ngrams_5_freq['word1'] <- sapply(strsplit(as.character(ngrams_5_freq$pentagram)," "), `[`, 1)
ngrams_5_freq['word2'] <- sapply(strsplit(as.character(ngrams_5_freq$pentagram)," "), `[`, 2)
ngrams_5_freq['word3'] <- sapply(strsplit(as.character(ngrams_5_freq$pentagram)," "), `[`, 3)
ngrams_5_freq['word4'] <- sapply(strsplit(as.character(ngrams_5_freq$pentagram)," "), `[`, 4)
ngrams_5_freq['word5'] <- sapply(strsplit(as.character(ngrams_5_freq$pentagram)," "), `[`, 5)

ngrams_5_freq <- ngrams_5_freq[,c('pentagram','word1','word2','word3','word4','word5','freq')]
ngrams_5_freq <- ngrams_5_freq[order(ngrams_5_freq$freq,decreasing = TRUE),]

ngrams_5_freq['cum_freq'] <- cumsum(ngrams_5_freq$freq)/sum(ngrams_5_freq$freq)

coverage_50 <- ngrams_5_freq[ngrams_5_freq['cum_freq']<=0.5,]
coverage_90 <- ngrams_5_freq[ngrams_5_freq['cum_freq']<=0.9,]
coverage_95 <- ngrams_5_freq[ngrams_5_freq['cum_freq']<=0.95,]

coverage_50 <- format(dim(coverage_50)[1], nsmall=0, big.mark=',')
coverage_90 <- format(dim(coverage_90)[1], nsmall=0, big.mark=',')
coverage_95 <- format(dim(coverage_95)[1], nsmall=0, big.mark=',')

ngrams_5_freq[1:10,c("pentagram","freq")]
pruning <- 1:10
ngrams_5_number <- sapply(pruning, FUN=function(x) {
    sum(ngrams_5_freq[ngrams_5_freq['freq']<x,'freq'])
})
compression <- sapply(ngrams_5_number, FUN=function(x) {
    x/length(ngrams_5)
})
df_p <- data.frame(pruning,ngrams_5_number,compression)

df_p[,'pruning'] <- sapply(df_p[,'pruning'], FUN=function(x){
  paste("<",as.character(x))
})
names(df_p) <- c("freq 5-grams removed","N","reduction")

df_p
ngrams_5_freq <- ngrams_5_freq[ngrams_5_freq['freq']>1,]

ngrams_5_freq['tetragram'] <- paste(ngrams_5_freq$word1, ngrams_5_freq$word2, 
                                    ngrams_5_freq$word3, ngrams_5_freq$word4, sep=" ")

#see comment in the 2-gram section regarding this
# pentagram_matrix <- merge(ngrams_5_freq[,c('tetragram','word5','freq')],
#                           ngrams_4_freq[,c('tetragram','freq')],
#                           by.x="tetragram",
#                           by.y="tetragram",
#                           all=TRUE)

aggregate_4_gram <- aggregate(ngrams_5_freq$freq, by=list(Category=ngrams_5_freq$tetragram), FUN=sum)
ngrams_5_freq <- merge(ngrams_5_freq,
                       aggregate_4_gram,
                       by.x='tetragram',
                       by.y='Category')
names(ngrams_5_freq)[10] <- c('tetragram_freq')
ngrams_5_freq['P'] <- ngrams_5_freq['freq']/ngrams_5_freq['tetragram_freq']

# dim(ngrams_5_freq[is.na(ngrams_5_freq['word5']),])
# dim(ngrams_5_freq[is.na(ngrams_5_freq['word4']),])
# dim(ngrams_5_freq[is.na(ngrams_5_freq['word3']),])
# dim(ngrams_5_freq[is.na(ngrams_5_freq['word2']),])
# dim(ngrams_5_freq[is.na(ngrams_5_freq['word1']),])
# ngrams_5_freq[is.na(ngrams_5_freq['P']),'P']

size_pentagram_freq <- format(object.size(ngrams_5_freq[,c(1:9)]), units = "Mb", standard = "auto", digits = 1L)
size_pentagram_P <- format(object.size(ngrams_5_freq), units = "Mb", standard = "auto", digits = 1L)

pentagram_matrix <- acast(ngrams_5_freq, tetragram~word5, value.var="P")
# sum(pentagram_matrix[is.na(pentagram_matrix)])

size_pentagram_matrix <- format(object.size(pentagram_matrix), 
                                units = "Mb", standard = "auto", digits = 1L)

# table(ngrams_5)['thank you so much for']/table(ngrams_4)['thank you so much']
# ngrams_5_freq[ngrams_5_freq['tetragram']=='thank you so much' & ngrams_5_freq['word5']=='for','P']
# sum(ngrams_5_freq[ngrams_5_freq['tetragram']=='thank you so much','P'])
# pentagram_matrix['thank you so much','for']

Summary of the models

df <- data.frame(n=c(length(ngrams_1_freq$ngram), 
                    length(ngrams_2_freq$bigram),
                    length(ngrams_3_freq$trigram),
                    length(ngrams_4_freq$tetragram),
                    length(ngrams_5_freq$pentagram)),
                s1=c(format(object.size(ngrams_1_freq[,1:2]), units = "Mb", standard = "auto", digits = 1L),
                     format(object.size(ngrams_2_freq[,1:5]), units = "Mb", standard = "auto", digits = 1L),
                     format(object.size(ngrams_3_freq[,1:7]), units = "Mb", standard = "auto", digits = 1L),
                     format(object.size(ngrams_4_freq[,1:8]), units = "Mb", standard = "auto", digits = 1L),
                     format(object.size(ngrams_5_freq[,1:9]), units = "Mb", standard = "auto", digits = 1L)),
                s2=c(format(object.size(ngrams_1_freq), units = "Mb", standard = "auto", digits = 1L),
                     format(object.size(ngrams_2_freq), units = "Mb", standard = "auto", digits = 1L),
                     format(object.size(ngrams_3_freq), units = "Mb", standard = "auto", digits = 1L),
                     format(object.size(ngrams_4_freq), units = "Mb", standard = "auto", digits = 1L),
                     format(object.size(ngrams_5_freq), units = "Mb", standard = "auto", digits = 1L)),
                s3=c(NA,
                     format(object.size(bigram_matrix), units = "Mb", standard = "auto", digits = 1L),
                     format(object.size(trigram_matrix), units = "Mb", standard = "auto", digits = 1L),
                     size_tetragram_matrix,
                     size_pentagram_matrix))
rownames(df) <- c("1-gram","2-gram","3-gram","4-gram","5-gram")
colnames(df) <- c("ngrams N",
                  "Freq dataframe",
                  "P dataframe",
                  "Full P matrix")
df                


rm(bigram_matrix)               
rm(trigram_matrix)               
rm(tetragram_matrix)               
rm(pentagram_matrix)

Testing the models

The testing dataset

set.seed(1234)
testing_dataset_small <- sample(testing_dataset,length(testing_dataset)*0.0001)
testing_lines <- format(length(testing_dataset_small), big.mark = ',')
testing_words <- sapply(testing_dataset_small, FUN=function(x){
  length(strsplit(x," ")[[1]])
})
testing_words <- format(sum(testing_words), big.mark = ',')


predictions <- sapply(testing_dataset_small, FUN=function(x) {
    x <- tolower(x)
    words <- unlist(strsplit(x, " "))
    if (length(words)>1) {
        results <- c()
        for (i in 1:(length(words)-1)) {
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            start <- Sys.time()
            prediction_df <- ngrams_2_freq[ngrams_2_freq['word1']==word1,]
            prediction_df <- prediction_df[order(-prediction_df$P),'word2']
            prediction_3words <- prediction_df[1:3]
            prediction_2words <- prediction_df[1:2]
            prediction_1words <- prediction_df[1]
            end <- Sys.time()
            t <- as.numeric(end - start)
            if (!is.na(word2)) {
                results <- c(results,
                             word2 %in% prediction_3words,
                             word2 %in% prediction_2words,
                             word2 %in% prediction_1words,
                             is.na(prediction_3words[1])&is.na(prediction_3words[2])&is.na(prediction_3words[3]),
                             NA,
                             t)
            }
            results
        }
        results2 <- matrix(results, ncol=6, byrow=TRUE)
        results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2))
        results2_means
    }
})

predictions <- matrix(unlist(predictions), ncol=6, byrow=TRUE)
predictions_accuracy_means_2_gram <- sapply(data.frame(predictions), FUN=mean)


predictions <- sapply(testing_dataset_small, FUN=function(x) {
    x <- tolower(x)
    words <- unlist(strsplit(x, " "))
    if (length(words)>2) {
        results <- c()
        for (i in 1:(length(words)-2)) {
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])
            start <- Sys.time()
            prediction_df <- ngrams_3_freq[ngrams_3_freq['word1']==word1 & 
                                           ngrams_3_freq['word2']==word2,]
            prediction_df <- prediction_df[order(-prediction_df$P),'word3']
            prediction_3words <- prediction_df[1:3]
            prediction_2words <- prediction_df[1:2]
            prediction_1words <- prediction_df[1]
            end <- Sys.time()
            t <- as.numeric(end - start)
            if (!is.na(word3)) {
                results <- c(results,
                             word3 %in% prediction_3words,
                             word3 %in% prediction_2words,
                             word3 %in% prediction_1words,
                             is.na(prediction_3words[1])&is.na(prediction_3words[2])&is.na(prediction_3words[3]),
                             NA,
                             t)
            }
            results
        }
        results2 <- matrix(results, ncol=6, byrow=TRUE)
        results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2))
        results2_means
    }
})

predictions <- matrix(unlist(predictions), ncol=6, byrow=TRUE)
predictions_accuracy_means_3_gram <- sapply(data.frame(predictions), FUN=mean)

predictions <- sapply(testing_dataset_small, FUN=function(x) {
    x <- tolower(x)
    words <- unlist(strsplit(x, " "))
    if (length(words)>3) {
        results <- c()
        for (i in 1:(length(words)-3)) {
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])
            word4 <- gsub("[^a-z']","",words[i+3])
            start <- Sys.time()
            prediction_df <- ngrams_4_freq[ngrams_4_freq['word1']==word1 & 
                                           ngrams_4_freq['word2']==word2 &
                                           ngrams_4_freq['word3']==word3,]
            prediction_df <- prediction_df[order(-prediction_df$P),'word4']
            prediction_3words <- prediction_df[1:3]
            prediction_2words <- prediction_df[1:2]
            prediction_1words <- prediction_df[1]
            end <- Sys.time()
            t <- as.numeric(end - start)
            if (!is.na(word4)) {
                results <- c(results,
                             word4 %in% prediction_3words,
                             word4 %in% prediction_2words,
                             word4 %in% prediction_1words,
                             is.na(prediction_3words[1])&is.na(prediction_3words[2])&is.na(prediction_3words[3]),
                             NA,
                             t)
            }
            results
        }
        results2 <- matrix(results, ncol=6, byrow=TRUE)
        results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2))
        results2_means
    }
})

predictions <- matrix(unlist(predictions), ncol=6, byrow=TRUE)
predictions_accuracy_means_4_gram <- sapply(data.frame(predictions), FUN=mean)

predictions <- sapply(testing_dataset_small, FUN=function(x) {
    x <- tolower(x)
    words <- unlist(strsplit(x, " "))
    if (length(words)>4) {
        results <- c()
        for (i in 1:(length(words)-4)) {
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])
            word4 <- gsub("[^a-z']","",words[i+3])
            word5 <- gsub("[^a-z']","",words[i+4])
            start <- Sys.time()
            prediction_df <- ngrams_5_freq[ngrams_5_freq['word1']==word1 & 
                                           ngrams_5_freq['word2']==word2 &
                                           ngrams_5_freq['word3']==word3 &
                                           ngrams_5_freq['word4']==word4,]
            prediction_df <- prediction_df[order(-prediction_df$P),'word5']
            prediction_3words <- prediction_df[1:3]
            prediction_2words <- prediction_df[1:2]
            prediction_1words <- prediction_df[1]
            end <- Sys.time()
            t <- as.numeric(end - start)
            if (!is.na(word5)) {
                results <- c(results,
                             word5 %in% prediction_3words,
                             word5 %in% prediction_2words,
                             word5 %in% prediction_1words,
                             is.na(prediction_3words[1])&is.na(prediction_3words[2])&is.na(prediction_3words[3]),
                             NA,
                             t)
            }
            results
        }
        results2 <- matrix(results, ncol=6, byrow=TRUE)
        results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2))
        results2_means
    }
})

predictions <- matrix(unlist(predictions), ncol=6, byrow=TRUE)
predictions_accuracy_means_5_gram <- sapply(data.frame(predictions), FUN=mean)

results_final <- data.frame(matrix(c(predictions_accuracy_means_2_gram,
                                     predictions_accuracy_means_3_gram, 
                                     predictions_accuracy_means_4_gram,
                                     predictions_accuracy_means_5_gram), ncol=6, byrow=TRUE))

results_final[,4] <- 1 - results_final[,4]
results_final[,6] <- round(results_final[,6],5)
results_final[,1:5] <- round(results_final[,1:5],2)

colnames(results_final) <- c("Accuracy 3w","Accuracy 2w","Accuracy 1w","Coverage","Perplexity","Time (sec)")
rownames(results_final) <- c("2-grams","3-grams","4-grams","5-grams")
results_final

Linear interpolation, smoothing and back-off algorithms.

Linear interpolation of the 1-gram, 2-gram and 3-gram models.

#this chunk was run in a separed window, the results were then loaded and processed in the next chunk

set.seed(1234)

training_dataset_2 <- ngrams_3[sample(length(ngrams_3),length(ngrams_3)*0.0001)]
results <- data.frame(trigrams=training_dataset_2)

w1 <- sample(seq(0.0000001,1,0.0000001),1000,replace=TRUE)
w2 <- sample(seq(0.0000001,1,0.0000001),1000,replace=TRUE)
w3 <- 1 - w1 - w2
weights <- data.frame(w1,w2,w3)
weights <- weights[weights['w3']>0,]

for (w in rownames(weights)) {
    col_results <- c()
    for (i in training_dataset_2) {
        # i <- "of parenting looks"
        # w<-1
        words <- unlist(strsplit(tolower(i)," "))
        head_bigram <- paste(words[1],words[2],sep=" ")
        
        if (head_bigram %in% ngrams_3_freq$bigram) {
            p_3 <- ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram,"P"]
            names(p_3) <- ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram,"word3"]
            p_3 <- sort(p_3)
            names(p_3) <- paste(head_bigram,names(p_3),sep=" ")
        } else {
            p_3 <- rep(0, dim(ngrams_3_freq)[1])
            names(p_3) <- paste(head_bigram,ngrams_3_freq$word3,sep=" ")
        }
        
        if (words[2] %in% ngrams_2_freq$word1) {
            p_2 <- ngrams_2_freq[ngrams_2_freq['word1']==word1,"P"]
            names(p_2) <- ngrams_2_freq[ngrams_2_freq['word1']==word1,"word2"]
            p_2 <- sort(p_2)
            names(p_2) <- paste(words[1],words[2],names(p_2),sep=" ")
        } else {
            p_2 <- rep(0, dim(ngrams_2_freq)[1])
            names(p_2) <- paste(words[1],words[2],ngrams_2_freq$word2,sep=" ")
        }
        
        p_1 <- rep(1/dim(ngrams_1_freq)[1], dim(ngrams_1_freq)[1])
        names(p_1) <- paste(words[1],words[2],ngrams_1_freq$ngram)
        
        p_1 <- p_1 * weights[w,'w1']
        p_2 <- p_2 * weights[w,'w2']
        p_3 <- p_3 * weights[w,'w3']
        
        final_p <- p_1
        m <- match(names(p_2), names(p_1))
        final_p[m] <- final_p[m] + p_2
        m <- match(names(p_3), names(final_p))
        final_p[m] <- final_p[m] + p_3
        
        final_p <- sort(final_p, decreasing=TRUE)[1:10]
        names(final_p) <- sapply(names(final_p), FUN=function(x){
          
          strsplit(x," ")[[1]][length(strsplit(x," ")[[1]])]
        
        })
        predictions <- data.frame(prediction=names(final_p), p=final_p)
        predictions['prediction'] <- as.character(predictions$prediction)
        predictions <- predictions[!duplicated(predictions$prediction),]
        predictions <- predictions[1:3,]
        if (words[3] %in% predictions$prediction) {
            col_results <- c(col_results,1)
        } else {
            col_results <- c(col_results,0)
        }
        # if (length(col_results)%%10==0) {print(length(col_results))}
    }
    results[,w] <- col_results
    print(w)
}

results['means',2:dim(results)[2]] <- lapply(results[,2:dim(results)[2]], FUN=mean)
weights <- rbind(c(NA,NA,NA),weights)
weights2 <- t(weights)
colnames(weights2) <- names(results)
results <- rbind(results,weights2)
write.csv(results,"interpolation_results_final.csv", row.names = TRUE, col.names = TRUE)


results <- read.csv("interpolation_results_final.csv")
rownames(results) <- results$X
results <- results[,2:dim(results)[2]]

par(mfrow=c(1,2))
plot(as.numeric(results['means',]), 
     xlab='Index unsorted weight combinations',
     ylab='Accuracy on 3rd word')
plot(sort(as.numeric(results['means',])),
     xlab='Index sorted weight combinations',
     ylab='Accuracy on 3rd word')

means <- unlist(results['mean',])
max_accuracy <- max(means[-1])
columns <- c(FALSE, means==max_accuracy)[-2]
max_Accuracy_weights <- t(results[c("means","w1","w2","w3"),columns])
highest_weights <- dim(max_Accuracy_weights)[1]


# Inputs

final_weights <- max_Accuracy_weights[1,c("w1","w2","w3")]

text <- tolower("I will be")

words <- unlist(strsplit(tolower(text)," "))

# Probabilities calculated using the 1- 2- and 3-gram models

p_trigrams <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1], words[2], sep=" ") &
                            ngrams_3_freq['word3']==words[3], "P"]
p_bigrams <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] &
                           ngrams_2_freq['word2']==words[3], "P"]
p_unigrams <- 1/length(dictionary)

# Weighted probabilities for linear interpolation 

p_unigrams_w <- p_unigrams * final_weights[1]
p_bigrams_w <- p_bigrams * final_weights[2]
p_trigrams_w <- p_trigrams * final_weights[3]

# Final probability

final_p <- p_unigrams_w + p_bigrams_w + p_trigrams_w

# Output table

values <- unname(c(p_unigrams,p_bigrams,p_trigrams,final_weights,
                   p_unigrams_w,p_bigrams_w,p_trigrams_w,
                   NA,NA,final_p))
matrix <- matrix(values, ncol=3, byrow=TRUE)
colnames(matrix) <- c("1-gram model","2-gram model","3-gram model")
rownames(matrix) <- c("P","weights","weighted P","total P")
round(matrix,3)


predictions <- sapply(testing_dataset_small, FUN=function(x) {
    x <- tolower(x)
    words <- unlist(strsplit(x, " "))
    if (length(words)>3) {
        results <- c()
        for (i in 1:(length(words)-3)) {
          
            # words <- c('i','will','be')
            # i<-1
          
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])
            head_bigram <- paste(word1, word2, sep=" ")
            
            start <- Sys.time()
            
            if (head_bigram %in% ngrams_3_freq$bigram) {
                p_3 <- ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram,"P"]
                names(p_3) <- ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram,"word3"]
                p_3 <- sort(p_3)
                names(p_3) <- paste(head_bigram,names(p_3),sep=" ")
            } else {
                p_3 <- rep(0, dim(ngrams_3_freq)[1])
                names(p_3) <- paste(head_bigram,ngrams_3_freq$word3,sep=" ")
            }
            
            if (word2 %in% ngrams_2_freq$word1) {
                p_2 <- ngrams_2_freq[ngrams_2_freq['word1']==word2,"P"]
                names(p_2) <- ngrams_2_freq[ngrams_2_freq['word1']==word2,"word2"]
                p_2 <- sort(p_2)
                names(p_2) <- paste(head_bigram,names(p_2),sep=" ")
            } else {
                p_2 <- rep(0, dim(ngrams_2_freq)[1])
                names(p_2) <- paste(head_bigram,ngrams_2_freq$word2,sep=" ")
            }
            
            p_1 <- rep(1/length(dictionary), length(dictionary))
            names(p_1) <- paste(head_bigram,dictionary)
            
            p_1 <- p_1 * final_weights['w1']
            p_2 <- p_2 * final_weights['w2']
            p_3 <- p_3 * final_weights['w3']
            
            final_p <- p_1
            m <- match(names(p_2), names(p_1))
            final_p[m] <- final_p[m] + p_2
            m <- match(names(p_3), names(final_p))
            final_p[m] <- final_p[m] + p_3
            
            trigram_probability <- final_p[paste(word1,word2,word3,sep=" ")]
            
            if (is.na(trigram_probability)) {
              trigram_probability <- 1/length(dictionary) * final_weights['w1']
            }
            
            final_p <- sort(final_p, decreasing=TRUE)[1:10]
            names(final_p) <- sapply(names(final_p), FUN=function(x){
              strsplit(x," ")[[1]][length(strsplit(x," ")[[1]])]
            })
            
            end <- Sys.time()
            t <- as.numeric(end - start)
            
            if (!is.na(word3)) {
                results <- c(results,
                             word3 %in% names(final_p[1:3]),
                             word3 %in% names(final_p[1:2]),
                             word3 %in% names(final_p[1]),
                             unique(!is.na(unique(final_p))),
                             as.numeric(1/trigram_probability),
                             t)
            }
            results
        }
        results2 <- matrix(results, ncol=6, byrow=TRUE)
        results2_means <- unlist(apply(data.frame(results2), FUN=mean, MARGIN=2))
        results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
        results2_means
    }
})

predictions <- matrix(unlist(predictions), ncol=6, byrow=TRUE)
predictions_accuracy_means_interpolated <- sapply(data.frame(predictions), FUN=mean)
predictions_accuracy_means_interpolated[5] <- exp(sum(log(predictions[,5]))/length(predictions[,5]))

results_final_liniter <- data.frame(t(predictions_accuracy_means_interpolated))
results_final_liniter[,6] <- round(results_final_liniter[,6],5)
results_final_liniter[,1:5] <- round(results_final_liniter[,1:5],3)
results_final_liniter[,5] <- round(results_final_liniter[,5],0)


colnames(results_final_liniter) <- c("Accuracy 3w","Accuracy 2w","Accuracy 1w",
                                     "Coverage","Perplexity","Time (sec)")
rownames(results_final_liniter) <- c("Linear int.")
results_final_liniter

Laplace smoothing

aggregate_2_grams <- aggregate(ngrams_2_freq$freq, by=list(Category=ngrams_2_freq$word1), FUN=sum)

get_bigrams_p_laplace_smoothing <- function(x) {
  
  words <- unlist(strsplit(tolower(x)," ")[1])
  
  if (length(words)>1 & (!"" %in% words)) {
    
    head_word <- words[(length(words)-1)]
    tail_word <- words[length(words)]
    all_gram_length <- length(ngrams_1_freq$ngram)
    all_bigrams <- paste(head_word,ngrams_1_freq$ngram, sep=" ")
    
    #Named vector of unobserved 2-grams

    freq_0_bigrams <- all_bigrams[!all_bigrams %in% ngrams_2_freq$bigram]
    freq_0_bigrams_named_vector <- rep(0,length(freq_0_bigrams))
    names(freq_0_bigrams_named_vector) <- freq_0_bigrams

    #Named vector of observed 2-grams
    
    freq_non_0_bigrams <- as.character(ngrams_2_freq[ngrams_2_freq['word1']==head_word, 'bigram'])
    freq_non_0_bigrams_vector <- ngrams_2_freq[ngrams_2_freq['word1']==head_word, 'freq']
    names(freq_non_0_bigrams_vector) <- freq_non_0_bigrams

    #Calculate the named vector of sorted probabilities according to the MLE extimation with
    #add-1 smoothing (Lapace smoothing)
    freq_0_bigrams_named_vector_p <- (freq_0_bigrams_named_vector + 1)/all_gram_length
    freq_non_0_bigrams_vector_p <- (freq_non_0_bigrams_vector+1)/
                                   (aggregate_2_grams[aggregate_2_grams['Category']==head_word,'x'] + 
                                   all_gram_length)

    final_p <- sort(c(freq_0_bigrams_named_vector_p,
                      freq_non_0_bigrams_vector_p),decreasing=TRUE)
    
    if (!is.na(final_p[paste(head_word, tail_word, sep=" ")])) {
      perplexity <- 1/final_p[paste(head_word, tail_word, sep=" ")] 
    } else {
      perplexity <- length(dictionary)
    }
    
    list(final_p, perplexity)

  } else {
    list(NA,NA)
  }
}

example <- "will be"

results <- get_bigrams_p_laplace_smoothing(example)

results[[1]]['will be']

get_trigrams_p_laplace_smoothing <- function(x) {
  
  words <- unlist(strsplit(tolower(x)," ")[1])
  
  if (length(words)>2 & (!"" %in% words)) {
    
    head_bigram <- paste(words[(length(words)-2)], words[(length(words)-1)], sep=" ")
    tail_word <- words[length(words)]
    all_gram_length <- length(ngrams_1_freq$ngram)
    all_trigrams <- paste(head_bigram,ngrams_1_freq$ngram, sep=" ")
    
    #Named vector of observed 3-grams

    freq_0_trigrams <- all_trigrams[!all_trigrams %in% ngrams_3_freq$trigram]
    freq_0_trigrams_named_vector <- rep(0,length(freq_0_trigrams))
    names(freq_0_trigrams_named_vector) <- freq_0_trigrams
    freq_0_trigrams_named_vector_p <- (freq_0_trigrams_named_vector + 1)/all_gram_length

    #Named vector of observed 3-grams
    
    freq_non_0_trigrams <- as.character(ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram, 'trigram'])
    freq_non_0_trigrams_vector <- ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram, 'freq']
    names(freq_non_0_trigrams_vector) <- freq_non_0_trigrams
    
    freq_non_0_trigrams_vector_p <- (freq_non_0_trigrams_vector+1)/
                                    (aggregate_3_grams[aggregate_3_grams['Category']==head_bigram,'x'] + 
                                     all_gram_length)

    #Calculate the named vector of sorted probabilities according to the MLE extimation with
    #add-1 smoothing (Lapace smoothing)

    final_p <- sort(c(freq_0_trigrams_named_vector_p,
                      freq_non_0_trigrams_vector_p),decreasing=TRUE)
    
    perplexity <- 1/final_p[paste(head_bigram, tail_word, sep=" ")] 
    
    list(final_p, perplexity)

  } else {
    list(NA,NA)
  }
}

get_tetragrams_p_laplace_smoothing <- function(x) {
  
  words <- unlist(strsplit(tolower(x)," ")[1])
  
  if (length(words)>3 & (!"" %in% words)) {
    
    head_trigram <- paste(words[(length(words)-3)], words[(length(words)-2)], words[(length(words)-1)], sep=" ")
    tail_word <- words[length(words)]
    all_gram_length <- length(ngrams_1_freq$ngram)
    all_tetragrams <- paste(head_trigram,ngrams_1_freq$ngram, sep=" ")
    
    #Named vector of observed 4-grams

    freq_0_tetragrams <- all_tetragrams[!all_tetragrams %in% ngrams_4_freq$tetragram]
    freq_0_tetragrams_named_vector <- rep(0,length(freq_0_tetragrams))
    names(freq_0_tetragrams_named_vector) <- freq_0_tetragrams
    freq_0_tetragrams_named_vector_p <- (freq_0_tetragrams_named_vector + 1)/all_gram_length

    #Named vector of observed 4-grams
    
    freq_non_0_tetragrams <- as.character(ngrams_4_freq[ngrams_4_freq['trigram']==head_trigram, 'tetragram'])
    freq_non_0_tetragrams_vector <- ngrams_4_freq[ngrams_4_freq['trigram']==head_trigram, 'freq']
    names(freq_non_0_tetragrams_vector) <- freq_non_0_tetragrams

    freq_non_0_tetragrams_vector_p <- (freq_non_0_tetragrams_vector+1)/
                                      (aggregate_4_grams[aggregate_4_grams['Category']==head_trigram,'x'] + 
                                       all_gram_length)

    #Calculate the named vector of sorted probabilities according to the MLE extimation with
    #add-1 smoothing (Lapace smoothing)

    final_p <- sort(c(freq_0_tetragrams_named_vector_p,
                      freq_non_0_tetragrams_vector_p),decreasing=TRUE)
    
    perplexity <- 1/final_p[paste(head_trigram, tail_word, sep=" ")] 
    
    list(final_p, perplexity)

  } else {
    list(NA,NA)
  }
}

get_pentagrams_p_laplace_smoothing <- function(x) {
  
  words <- unlist(strsplit(tolower(x)," ")[1])
  
  if (length(words)>4 & (!"" %in% words)) {
    
    head_tetragram <- paste(words[(length(words)-4)], 
                            words[(length(words)-3)],
                            words[(length(words)-2)],
                            words[(length(words)-1)], sep=" ")
    
    tail_word <- words[length(words)]
    all_gram_length <- length(ngrams_1_freq$ngram)
    all_pentagrams <- paste(head_tetragram,ngrams_1_freq$ngram, sep=" ")
    
    #Named vector of observed 5-grams

    freq_0_pentagrams <- all_pentagrams[!all_pentagrams %in% ngrams_5_freq$pentagram]
    freq_0_pentagrams_named_vector <- rep(0,length(freq_0_pentagrams))
    names(freq_0_pentagrams_named_vector) <- freq_0_pentagrams
    freq_0_pentagrams_named_vector_p <- (freq_0_pentagrams_named_vector + 1)/all_gram_length

    #Named vector of observed 5-grams
    
    freq_non_0_pentagrams <- as.character(
      ngrams_5_freq[ngrams_5_freq['tetragram']==head_tetragram, 'pentagram']
    )
    freq_non_0_pentagrams_vector <- ngrams_5_freq[ngrams_5_freq['tetragram']==head_tetragram, 'freq']
    names(freq_non_0_pentagrams_vector) <- freq_non_0_pentagrams
    
    freq_non_0_pentagrams_vector_p <- (freq_non_0_pentagrams_vector+1)/
                                      (aggregate_5_grams[aggregate_5_grams['Category']==head_tetragram,'x']+
                                      all_gram_length)

    #Calculate the named vector of sorted probabilities according to the MLE extimation with
    #add-1 smoothing (Lapace smoothing)

    final_p <- sort(c(freq_0_pentagrams_named_vector_p,
                      freq_non_0_pentagrams_vector_p),decreasing=TRUE)
    
    perplexity <- 1/final_p[paste(head_tetragram, tail_word, sep=" ")] 
    
    list(final_p, perplexity)

  } else {
    list(NA,NA)
  }
}

aggregate_2_grams <- aggregate(ngrams_2_freq$freq, by=list(Category=ngrams_2_freq$word1), FUN=sum)
aggregate_3_grams <- aggregate(ngrams_3_freq$freq, by=list(Category=ngrams_3_freq$bigram), FUN=sum)
aggregate_4_grams <- aggregate(ngrams_4_freq$freq, by=list(Category=ngrams_4_freq$trigram), FUN=sum)
aggregate_5_grams <- aggregate(ngrams_5_freq$freq, by=list(Category=ngrams_5_freq$pentagram), FUN=sum)


# example <- "i will be"
# results <- get_trigrams_p_laplace_smoothing(example)

# example <- "thank you so much"
# results <- get_tetragrams_p_laplace_smoothing(example)

# example <- "thank you so much for"
# results <- get_pentagrams_p_laplace_smoothing(example)

# results

predictions_2_grams <- sapply(testing_dataset_small, FUN=function(x) {
    words <- unlist(strsplit(tolower(x), " "))
    if (length(words)>2) {
        results <- c()
        for (i in 1:(length(words)-1)) {
          
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])

            start <- Sys.time()
            
            data <- get_bigrams_p_laplace_smoothing(paste(word1,
                                                          word2,sep=" "))
            end <- Sys.time()
            t <- as.numeric(end - start)
            
            predicted_words <- sapply(names(data[[1]]), FUN=function(x){
              predicted_words <- strsplit(x," ")[[1]]
              predicted_last_word <- predicted_words[length(predicted_words)]
              predicted_last_word
            })

            if (!is.na(word2) & (!"" %in% words)) {
              
                if (is.na(data[[2]])){data[[2]]<-1/length(dictionary)}
              
                results <- c(results,
                            word2 %in% predicted_words[1:3],
                            word2 %in% predicted_words[1:2],
                            word2 %in% predicted_words[1],
                            unique(!is.na(data[[1]])),
                            data[[2]],
                            t)
            }
    
        }
        results2 <- matrix(results, ncol=6, byrow=TRUE)
        results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
        results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
        results2_means
    }
})

predictions_2_grams <- matrix(unlist(predictions_2_grams), ncol=6, byrow=TRUE)
predictions_2_grams_means <- sapply(data.frame(predictions_2_grams), FUN=mean, na.rm=TRUE)
predictions_2_grams_means[5] <- exp(sum(log(predictions_2_grams[,5]))/length(predictions_2_grams[,5]))

predictions_3_grams <- sapply(testing_dataset_small, FUN=function(x) {
    words <- unlist(strsplit(tolower(x), " "))
    if (length(words)>3) {
        results <- c()
        for (i in 1:(length(words)-2)) {
          
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])

            start <- Sys.time()
            
            data <- get_trigrams_p_laplace_smoothing(paste(word1,
                                                           word2,
                                                           word3,sep=" "))
            
            end <- Sys.time()
            t <- as.numeric(end - start)
            
            predicted_words <- sapply(names(data[[1]]), FUN=function(x){
              predicted_words <- strsplit(x," ")[[1]]
              predicted_last_word <- predicted_words[length(predicted_words)]
              predicted_last_word
            })

            if (!is.na(word2) & (!"" %in% words)) {
              
                if (is.na(data[[2]])){data[[2]]<-1/length(dictionary)}

                results <- c(results,
                            word3 %in% predicted_words[1:3],
                            word3 %in% predicted_words[1:2],
                            word3 %in% predicted_words[1],
                            unique(!is.na(data[[1]])),
                            data[[2]],
                            t)
            }
            results
        }
        results2 <- matrix(results, ncol=6, byrow=TRUE)
        results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
        results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
        results2_means
    }
})

predictions_3_grams <- matrix(unlist(predictions_3_grams), ncol=6, byrow=TRUE)
predictions_3_grams_means <- sapply(data.frame(predictions_3_grams), FUN=mean, na.rm=TRUE)
predictions_3_grams_means[5] <- exp(sum(log(predictions_3_grams[,5]))/length(predictions_3_grams[,5]))

predictions_4_grams <- sapply(testing_dataset_small, FUN=function(x) {
    words <- unlist(strsplit(tolower(x), " "))
    if (length(words)>4) {
        results <- c()
        for (i in 1:(length(words)-3)) {
          
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])
            word4 <- gsub("[^a-z']","",words[i+3])

            start <- Sys.time()
            
            data <- get_tetragrams_p_laplace_smoothing(paste(word1,
                                                             word2,
                                                             word3,
                                                             word4,sep=" "))
            
            end <- Sys.time()
            t <- as.numeric(end - start)
            
            predicted_words <- sapply(names(data[[1]]), FUN=function(x){
              predicted_words <- strsplit(x," ")[[1]]
              predicted_last_word <- predicted_words[length(predicted_words)]
              predicted_last_word
            })

            if (!is.na(word2) & (!"" %in% words)) {
              
                if (is.na(data[[2]])){data[[2]]<-1/length(dictionary)}

                results <- c(results,
                            word4 %in% predicted_words[1:3],
                            word4 %in% predicted_words[1:2],
                            word4 %in% predicted_words[1],
                            unique(!is.na(data[[1]])),
                            data[[2]],
                            t)
            }
            results
        }
        results2 <- matrix(results, ncol=6, byrow=TRUE)
        results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
        results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
        results2_means
    }
})

predictions_4_grams <- matrix(unlist(predictions_4_grams), ncol=6, byrow=TRUE)
predictions_4_grams_means <- sapply(data.frame(predictions_4_grams), FUN=mean, na.rm=TRUE)
predictions_4_grams_means[5] <- exp(sum(log(predictions_4_grams[,5]))/length(predictions_4_grams[,5]))

predictions_5_grams <- sapply(testing_dataset_small, FUN=function(x) {
    words <- unlist(strsplit(tolower(x), " "))
    if (length(words)>5) {
        results <- c()
        for (i in 1:(length(words)-4)) {
          
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])
            word4 <- gsub("[^a-z']","",words[i+3])
            word5 <- gsub("[^a-z']","",words[i+4])

            start <- Sys.time()
            
            data <- get_pentagrams_p_laplace_smoothing(paste(word1,
                                                             word2,
                                                             word3,
                                                             word4,
                                                             word5,sep=" "))
            
            end <- Sys.time()
            t <- as.numeric(end - start)
            
            predicted_words <- sapply(names(data[[1]]), FUN=function(x){
              predicted_words <- strsplit(x," ")[[1]]
              predicted_last_word <- predicted_words[length(predicted_words)]
              predicted_last_word
            })

            if (!is.na(word2) & (!"" %in% words)) {
              
                if (is.na(data[[2]])){data[[2]]<-1/length(dictionary)}

                results <- c(results,
                            word5 %in% predicted_words[1:3],
                            word5 %in% predicted_words[1:2],
                            word5 %in% predicted_words[1],
                            unique(!is.na(data[[1]])),
                            data[[2]],
                            t)
            }
            results
        }
        results2 <- matrix(results, ncol=6, byrow=TRUE)
        results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
        results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
        results2_means
    }
})

predictions_5_grams <- matrix(unlist(predictions_5_grams), ncol=6, byrow=TRUE)
predictions_5_grams_means <- sapply(data.frame(predictions_5_grams), FUN=mean, na.rm=TRUE)
predictions_5_grams_means[5] <- exp(sum(log(predictions_5_grams[,5]))/length(predictions_5_grams[,5]))

results_final_laplace <- data.frame(rbind(predictions_5_grams_means,
                                          predictions_4_grams_means,
                                          predictions_3_grams_means,
                                          predictions_2_grams_means))

results_final_laplace[,6] <- round(results_final_laplace[,6],5)
results_final_laplace[,1:5] <- round(results_final_laplace[,1:5],3)
results_final_laplace[,5] <- round(results_final_laplace[,5],0)

colnames(results_final_laplace) <- c("Accuracy 3w",
                                     "Accuracy 2w",
                                     "Accuracy 1w",
                                     "Coverage",
                                     "Perplexity",
                                     "Time (sec)")
rownames(results_final_laplace) <- c("Laplace 5g",
                                     "Laplace 4g",
                                     "Laplace 3g",
                                     "Laplace 2g")
results_final_laplace

Stupid backoff algorithm

# process words

example1 <- "Thank you so much for this"
example2 <- "I will be a mechanic"

stupid_back_off <- function(x) {

  words <- strsplit(tolower(x)," ")[[1]]
  
  #calculates probability based on 5-gram model
  
  p <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(words[1:4], collapse=" ") &
                     ngrams_5_freq['word5'] == words[5], 'P']
  
  #back-off to 4-gram model
  
  if (length(p)==0) {
    p <- 0.4*ngrams_4_freq[ngrams_4_freq['trigram']==paste(words[2:4], collapse=" ") &
                           ngrams_4_freq['word4'] == words[5], 'P']
  }
  
  #back-off to 3-gram model
  
  if (length(p)==0) {
    p <- 0.4*0.4*ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[3:4], collapse=" ") &
                 ngrams_3_freq['word3'] == words[5], 'P']
  }
  
  #back-off to 2-gram model
  
  if (length(p)==0) {
    p <- 0.4*0.4*0.4*ngrams_2_freq[ngrams_2_freq['word1']==words[4] &
                                   ngrams_2_freq['word2'] == words[5], 'P']
  }
  
  #back-off to 1-gram model
  
  if (length(p)==0) {
    p <- 0.4*0.4*0.4*0.4*(1/length(dictionary))
  }

  p
}

stupid_back_off(example1)
stupid_back_off(example2)

predictions_SBO_5 <- sapply(testing_dataset_small, FUN=function(x) {
  
    x <- tolower(x)
    words <- unlist(strsplit(x, " "))
    word1 <- NA
    word2 <- NA
    word3 <- NA
    word4 <- NA
    word5 <- NA
    results <- c()
    
    if (length(words)>2) {
      
        for (i in 1:(length(words)-2)) {
          
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])
            word4 <- gsub("[^a-z']","",words[i+3])
            word5 <- gsub("[^a-z']","",words[i+4])

            cleaned_words <- c(word1,word2,word3,word4,word5)
            cleaned_words <- cleaned_words[!is.na(cleaned_words)]
            
            start <- Sys.time()
            
            p_5_gram <- c()
            tail_word_5_gram <- c()

            p_4_gram <- c()
            tail_word_4_gram <- c()

            p_3_gram <- c()
            tail_word_3_gram <- c()

            p_2_gram <- c()
            tail_word_2_gram <- c()

            L <- length(cleaned_words)

            if (L>4) {
              p_5_gram <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(cleaned_words[(L-4):(L-1)]
                                                                          ,collapse=" "),
                                        'P']
              if (length(p_5_gram)!=0) {
                tail_word_5_gram <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(cleaned_words[(L-4):(L-1)],
                                                                                    collapse=" "),'word5']
                names(p_5_gram) <- paste(paste(cleaned_words[(L-4):L],collapse=" "), tail_word_5_gram, sep=" ")
              }
            }
            if (L>3) {
              p_4_gram <- ngrams_4_freq[ngrams_4_freq['trigram']==paste(cleaned_words[(L-3):(L-1)],
                                                                        collapse=" ")&
                                        !ngrams_4_freq$word4 %in% tail_word_5_gram,
                                        'P']*0.4
              if (length(p_4_gram)!=0) {
                tail_word_4_gram <- ngrams_4_freq[ngrams_4_freq['trigram']==paste(cleaned_words[(L-3):(L-1)],
                                                                                    collapse=" ") &
                                                  !ngrams_4_freq$word4 %in% tail_word_5_gram,'word4']
                names(p_4_gram) <- paste(paste(cleaned_words[(L-3):L],collapse=" "), tail_word_4_gram, sep=" ")
              }
            }
            if (L>2) {
              p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
                                                                       collapse=" ") &
                                        !ngrams_3_freq$word3 %in% tail_word_4_gram &
                                        !ngrams_3_freq$word3 %in% tail_word_5_gram,
                                        'P']*0.4*0.4
              if (length(p_3_gram)!=0) {
                tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
                                                                                    collapse=" ") &
                                                  !ngrams_3_freq$word3 %in% tail_word_4_gram &
                                                  !ngrams_3_freq$word3 %in% tail_word_5_gram,'word3']
                names(p_3_gram) <- paste(paste(cleaned_words[(L-2):L],collapse=" "), tail_word_3_gram, sep=" ")
              }
            }
            if (L>1) {
              p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
                                        !ngrams_2_freq$word2 %in% tail_word_3_gram &
                                        !ngrams_2_freq$word2 %in% tail_word_4_gram &
                                        !ngrams_2_freq$word2 %in% tail_word_5_gram,
                                        'P']*0.4*0.4*0.4
              if (length(p_2_gram)!=0) {
                tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
                                                  !ngrams_2_freq$word2 %in% tail_word_3_gram &
                                                  !ngrams_2_freq$word2 %in% tail_word_4_gram &
                                                  !ngrams_2_freq$word2 %in% tail_word_5_gram,
                                                  'word2']
                names(p_2_gram) <- paste(paste(cleaned_words[(L-1):L],collapse=" "), tail_word_2_gram, sep=" ")
              }
            }

            tail_word_1_gram <- dictionary[!dictionary %in% tail_word_5_gram &
                                           !dictionary %in% tail_word_4_gram &
                                           !dictionary %in% tail_word_3_gram &
                                           !dictionary %in% tail_word_2_gram]
            p_1_gram <- 1/length(dictionary)
            p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))*0.4*0.4*0.4*0.4
            names(p_1_gram) <- tail_word_1_gram

            p_SBO <- sort(c(p_5_gram, p_4_gram, p_3_gram, p_2_gram, p_1_gram),decreasing=TRUE)
            end <- Sys.time()
            t <- as.numeric(end - start)
            
            predicted_words <- sapply(names(p_SBO), FUN=function(x){
              predicted_words <- strsplit(x," ")[[1]]
              predicted_last_word <- predicted_words[length(predicted_words)]
              predicted_last_word
            })

            names(p_SBO) <- predicted_words

            if (!cleaned_words[L] %in% dictionary) {
              results <- c(results,
                          cleaned_words[L] %in% predicted_words[1:3],
                          cleaned_words[L] %in% predicted_words[1:2],
                          cleaned_words[L] %in% predicted_words[1],
                          0,
                          1/length(dictionary),
                          t)
            } else {

              results <- c(results,
                          cleaned_words[L] %in% predicted_words[1:3],
                          cleaned_words[L] %in% predicted_words[1:2],
                          cleaned_words[L] %in% predicted_words[1],
                          unique(!is.na(predicted_words)),
                          1/p_SBO[cleaned_words[L]],
                          t)
            }

        }
        
    }
    
    if (length(results)!=0) {
      results2 <- matrix(results, ncol=6, byrow=TRUE)
      results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
      results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
      results2_means
    }
})

predictions_SBO_5 <- matrix(unlist(predictions_SBO_5), ncol=6, byrow=TRUE)
predictions_SBO_5_means <- sapply(data.frame(predictions_SBO_5), FUN=mean, na.rm=TRUE)
predictions_SBO_5[5] <- exp(sum(log(predictions_SBO_5[,5]))/length(predictions_SBO_5[,5]))

predictions_SBO_3 <- sapply(testing_dataset_small, FUN=function(x) {

    x <- tolower(x)
    words <- unlist(strsplit(x, " "))
    word1 <- NA
    word2 <- NA
    word3 <- NA
    results <- c()
    
    if (length(words)>2) {
      
        for (i in 1:(length(words)-2)) {
          
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])

            cleaned_words <- c(word1,word2,word3)
            cleaned_words <- cleaned_words[!is.na(cleaned_words)]
            
            start <- Sys.time()

            p_3_gram <- c()
            tail_word_3_gram <- c()

            p_2_gram <- c()
            tail_word_2_gram <- c()

            L <- length(cleaned_words)

            if (L>2) {
              p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],collapse=" "),
                                        'P']
              if (length(p_3_gram)!=0) {
                tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
                                                                                    collapse=" "),
                                                  'word3']
                names(p_3_gram) <- paste(paste(cleaned_words[L-2:L],collapse=" "), tail_word_3_gram, sep=" ")
              }
            }
            if (L>1) {
              p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
                                        !ngrams_2_freq$word2 %in% tail_word_3_gram,
                                        'P']*0.4
              if (length(p_2_gram)!=0) {
                tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
                                                  !ngrams_2_freq$word2 %in% tail_word_3_gram,
                                                  'word2']
                names(p_2_gram) <- paste(paste(cleaned_words[L-1:L],collapse=" "), tail_word_2_gram, sep=" ")
              }
            }

            tail_word_1_gram <- dictionary[!dictionary %in% tail_word_3_gram &
                                           !dictionary %in% tail_word_2_gram]
            p_1_gram <- 1/length(dictionary)*0.4*0.4
            p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))
            names(p_1_gram) <- tail_word_1_gram

            p_SBO <- sort(c(p_3_gram, p_2_gram, p_1_gram),decreasing=TRUE)
            end <- Sys.time()
            t <- as.numeric(end - start)

            predicted_words <- sapply(names(p_SBO), FUN=function(x){
              predicted_words <- strsplit(x," ")[[1]]
              predicted_last_word <- predicted_words[length(predicted_words)]
              predicted_last_word
            })

            names(p_SBO) <- predicted_words

            if (!cleaned_words[L] %in% dictionary) {
              results <- c(results,
                          cleaned_words[L] %in% predicted_words[1:3],
                          cleaned_words[L] %in% predicted_words[1:2],
                          cleaned_words[L] %in% predicted_words[1],
                          0,
                          1/length(dictionary),
                          t)
            } else {

              results <- c(results,
                          cleaned_words[L] %in% predicted_words[1:3],
                          cleaned_words[L] %in% predicted_words[1:2],
                          cleaned_words[L] %in% predicted_words[1],
                          unique(!is.na(predicted_words)),
                          1/p_SBO[cleaned_words[L]],
                          t)
            }

        }

    }

    if (length(results)!=0) {

      results2 <- matrix(results, ncol=6, byrow=TRUE)
      results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
      results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
      results2_means

    }
})

predictions_SBO_3 <- matrix(unlist(predictions_SBO_3), ncol=6, byrow=TRUE)
predictions_SBO_3_means <- sapply(data.frame(predictions_SBO_3), FUN=mean, na.rm=TRUE)
predictions_SBO_3[5] <- exp(sum(log(predictions_SBO_3[,5]))/length(predictions_SBO_3[,5]))


results_final_SBO <- data.frame(t(cbind(predictions_SBO_5_means,
                                        predictions_SBO_3_means)))
results_final_SBO[,6] <- round(results_final_SBO[,6],5)
results_final_SBO[,1:5] <- round(results_final_SBO[,1:5],3)
results_final_SBO[,5] <- round(results_final_SBO[,5],0)


colnames(results_final_SBO) <- c("Accuracy 3w","Accuracy 2w","Accuracy 1w",
                             "Coverage","Perplexity","Time (sec)")
rownames(results_final_SBO) <- c("SBO 5->1g","SBO 3->1g")
results_final_SBO

Katz back-off algorithm

# process words

library(edgeR)

#calculate discounts based on the Good-Turing extimation for unobserved ngrams

discount_5_gram <- sum(goodTuring(ngrams_5_freq$freq)['proportion'][[1]])
discount_4_gram <- sum(goodTuring(ngrams_4_freq$freq)['proportion'][[1]])
discount_3_gram <- sum(goodTuring(ngrams_3_freq$freq)['proportion'][[1]])
discount_2_gram <- sum(goodTuring(ngrams_2_freq$freq)['proportion'][[1]])

example <- "i will be"

words <- unlist(strsplit(example," "))

#discounted 3-gram probabilites

p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1:2],collapse=" "), 'P'] * discount_3_gram
tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1:2],collapse=" "), 'word3']
names(p_3_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_3_gram)
discounted_P_3_gram <- 1 - sum(p_3_gram)

#discounted 2-gram probabilites

p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] &
                          !ngrams_2_freq$word2 %in% tail_word_3_gram, 'P']
p_2_gram <- p_2_gram * discounted_P_3_gram/sum(p_2_gram) * discount_2_gram
tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] & 
                                  !ngrams_2_freq$word2 %in% tail_word_3_gram, 'word2']
names(p_2_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_2_gram, sep=" ")
discounted_P_2_gram <- 1 - sum(p_2_gram) - sum(p_3_gram)

#discounted 1-gram probabilites

tail_word_1_gram <- dictionary[!dictionary %in% tail_word_3_gram &
                               !dictionary %in% tail_word_2_gram]
p_1_gram <- 1/length(dictionary) * discounted_P_2_gram/sum(1/length(dictionary) * length(tail_word_1_gram))
p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))
names(p_1_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_1_gram)

#final probability vector

p_final <- c(p_3_gram, p_2_gram, p_1_gram)
p_final['i will be']


predictions_KBO_5 <- sapply(testing_dataset_small, FUN=function(x) {
  
    x <- tolower(x)
    words <- unlist(strsplit(x, " "))
    
    results <- c()
    word1 <- NA
    word2 <- NA
    word3 <- NA
    word4 <- NA
    word5 <- NA
    
    if (length(words)>2) {
        
        for (i in 1:(length(words)-2)) {
          
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])
            word4 <- gsub("[^a-z']","",words[i+3])
            word5 <- gsub("[^a-z']","",words[i+4])

            cleaned_words <- c(word1,word2,word3,word4,word5)
            cleaned_words <- cleaned_words[!is.na(cleaned_words)]
            
            start <- Sys.time()
            
            p_5_gram <- c()
            tail_word_5_gram <- c()
            discounted_P_5_gram <- 1
            
            p_4_gram <- c()
            tail_word_4_gram <- c()
            discounted_P_4_gram <- 1
            
            p_3_gram <- c()
            tail_word_3_gram <- c()
            discounted_P_3_gram <- 1
            
            p_2_gram <- c()
            tail_word_2_gram <- c()
            discounted_P_2_gram <- 1
            
            L <- length(cleaned_words)

            if (L>4) {
              p_5_gram <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(cleaned_words[(L-4):(L-1)],
                                                                          collapse=" "), 
                                        'P']
              p_5_gram <- p_5_gram * discount_5_gram
              if (length(p_5_gram)!=0) {
                tail_word_5_gram <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(cleaned_words[(L-4):(L-1)],
                                                                                    collapse=" "),'word5']
                names(p_5_gram) <- paste(paste(cleaned_words[(L-4):(L-1)],collapse=" "), 
                                         tail_word_5_gram, sep=" ")
                discounted_P_5_gram <- 1 - sum(p_5_gram)
              } 
            }
            if (L>3) {
              p_4_gram <- ngrams_4_freq[ngrams_4_freq['trigram']==paste(cleaned_words[(L-3):(L-1)],
                                                                        collapse=" ") &
                                        !ngrams_4_freq$word4 %in% tail_word_5_gram, 
                                        'P']
              p_4_gram <- p_4_gram *  discounted_P_5_gram/sum(p_4_gram) * discount_4_gram
              if (length(p_4_gram)!=0) {
                tail_word_4_gram <- ngrams_4_freq[ngrams_4_freq['trigram']==paste(cleaned_words[(L-3):(L-1)],
                                                                                    collapse=" ") &
                                                  !ngrams_4_freq$word4 %in% tail_word_5_gram,'word4']
                names(p_4_gram) <- paste(paste(cleaned_words[(L-3):(L-1)],collapse=" "), 
                                         tail_word_4_gram, sep=" ")
                discounted_P_4_gram <- 1 - sum(p_4_gram) - sum(p_5_gram)
              } 
            }
            if (L>2) {
              p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
                                                                       collapse=" ") &
                                        !ngrams_3_freq$word3 %in% tail_word_4_gram &
                                        !ngrams_3_freq$word3 %in% tail_word_5_gram, 
                                        'P']
              p_3_gram <- p_3_gram *  discounted_P_4_gram/sum(p_3_gram) * discount_3_gram
              if (length(p_3_gram)!=0) {
                tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
                                                                                    collapse=" ") &
                                                  !ngrams_3_freq$word3 %in% tail_word_4_gram &
                                                  !ngrams_3_freq$word3 %in% tail_word_5_gram,'word3']
                names(p_3_gram) <- paste(paste(cleaned_words[(L-2):(L-1)],
                                               collapse=" "), tail_word_3_gram, sep=" ")
                discounted_P_3_gram <- 1 - sum(p_3_gram) - sum(p_4_gram) - sum(p_5_gram)
              } 
            }
            if (L>1) {
              p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
                                        !ngrams_2_freq$word2 %in% tail_word_3_gram &
                                        !ngrams_2_freq$word2 %in% tail_word_4_gram &
                                        !ngrams_2_freq$word2 %in% tail_word_5_gram, 
                                        'P']
              p_2_gram <- p_2_gram *  discounted_P_3_gram/sum(p_2_gram) * discount_2_gram
              if (length(p_2_gram)!=0) {
                tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
                                                  !ngrams_2_freq$word2 %in% tail_word_3_gram &
                                                  !ngrams_2_freq$word2 %in% tail_word_4_gram &
                                                  !ngrams_2_freq$word2 %in% tail_word_5_gram, 
                                                  'word2']
                names(p_2_gram) <- paste(paste(cleaned_words[L-1:L],collapse=" "), tail_word_2_gram, sep=" ")
                discounted_P_2_gram <- 1 - sum(p_2_gram) - sum(p_3_gram) - sum(p_4_gram) - sum(p_5_gram)
              } 
            }
             
            tail_word_1_gram <- dictionary[!dictionary %in% tail_word_5_gram &
                                           !dictionary %in% tail_word_4_gram &
                                           !dictionary %in% tail_word_3_gram &
                                           !dictionary %in% tail_word_2_gram]
            p_1_gram <- 1/length(dictionary) * discounted_P_2_gram/sum(1/length(dictionary) * 
                        length(tail_word_1_gram))
            p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))
            names(p_1_gram) <- tail_word_1_gram

            p_KBO <- sort(c(p_5_gram, p_4_gram, p_3_gram, p_2_gram, p_1_gram),decreasing=TRUE)
            end <- Sys.time()
            t <- as.numeric(end - start)
            
            predicted_words <- sapply(names(p_KBO), FUN=function(x){
              predicted_words <- strsplit(x," ")[[1]]
              predicted_last_word <- predicted_words[length(predicted_words)]
              predicted_last_word
            })
            
            names(p_KBO) <- predicted_words
            
            if (!cleaned_words[L] %in% dictionary) {
              results <- c(results,
                          cleaned_words[L] %in% predicted_words[1:3],
                          cleaned_words[L] %in% predicted_words[1:2],
                          cleaned_words[L] %in% predicted_words[1],
                          0,
                          1/length(dictionary),
                          t)
            } else {
              
              results <- c(results,
                          cleaned_words[L] %in% predicted_words[1:3],
                          cleaned_words[L] %in% predicted_words[1:2],
                          cleaned_words[L] %in% predicted_words[1],
                          unique(!is.na(predicted_words)),
                          1/p_KBO[cleaned_words[L]],
                          t)
            }
            
        }
        
    }
    if (length(results)) {
      results2 <- matrix(results, ncol=6, byrow=TRUE)
      results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
      results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
      results2_means
    }
})

predictions_KBO_5 <- matrix(unlist(predictions_KBO_5), ncol=6, byrow=TRUE)
predictions_KBO_5_means <- sapply(data.frame(predictions_KBO_5), FUN=mean, na.rm=TRUE)
predictions_KBO_5[5] <- exp(sum(log(predictions_KBO_5[,5]))/length(predictions_KBO_5[,5]))

predictions_KBO_3 <- sapply(testing_dataset_small, FUN=function(x) {
  
    x <- tolower(x)
    words <- unlist(strsplit(x, " "))
    
    results <- c()
    word1 <- NA
    word2 <- NA
    word3 <- NA
    
    if (length(words)>2) {

        for (i in 1:(length(words)-2)) {
          
            word1 <- gsub("[^a-z']","",words[i])
            word2 <- gsub("[^a-z']","",words[i+1])
            word3 <- gsub("[^a-z']","",words[i+2])

            cleaned_words <- c(word1,word2,word3)
            # cleaned_words <- c("lot","of",NA,NA,NA)
            cleaned_words <- cleaned_words[!is.na(cleaned_words)]
            
            p_3_gram <- c()
            tail_word_3_gram <- c()
            discounted_P_3_gram <- 1
            
            p_2_gram <- c()
            tail_word_2_gram <- c()
            discounted_P_2_gram <- 1
            
            L <- length(cleaned_words)
            
            start <- Sys.time()

            if (L>2) {
              p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
                                                                       collapse=" "), 
                                        'P']
              p_3_gram <- p_3_gram * discount_3_gram
              if (length(p_3_gram)!=0) {
                tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
                                                                                    collapse=" "),
                                                  'word3']
                names(p_3_gram) <- paste(paste(cleaned_words[(L-2):(L-1)],collapse=" "),
                                         tail_word_3_gram, sep=" ")
                discounted_P_3_gram <- 1 - sum(p_3_gram)
              } 
            }
            if (L>1) {
              p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
                                        !ngrams_2_freq$word2 %in% tail_word_3_gram, 
                                        'P']
              p_2_gram <- p_2_gram *  discounted_P_3_gram/sum(p_2_gram) * discount_2_gram
              if (length(p_2_gram)!=0) {
                tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
                                                  !ngrams_2_freq$word2 %in% tail_word_3_gram, 
                                                  'word2']
                names(p_2_gram) <- paste(paste(cleaned_words[L-1:L],collapse=" "), tail_word_2_gram, sep=" ")
                discounted_P_2_gram <- 1 - sum(p_2_gram) - sum(p_3_gram)
              } 
            }
             
            tail_word_1_gram <- dictionary[!dictionary %in% tail_word_3_gram &
                                           !dictionary %in% tail_word_2_gram]
            p_1_gram <- 1/length(dictionary) * discounted_P_2_gram/sum(1/length(dictionary) * 
                        length(tail_word_1_gram))
            p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))
            names(p_1_gram) <- tail_word_1_gram

            p_KBO <- sort(c(p_3_gram, p_2_gram, p_1_gram),decreasing=TRUE)
            end <- Sys.time()
            t <- as.numeric(end - start)
            
            predicted_words <- sapply(names(p_KBO), FUN=function(x){
              predicted_words <- strsplit(x," ")[[1]]
              predicted_last_word <- predicted_words[length(predicted_words)]
              predicted_last_word
            })
            
            names(p_KBO) <- predicted_words
            
            
            if (!cleaned_words[L] %in% dictionary) {
              results <- c(results,
                          cleaned_words[L] %in% predicted_words[1:3],
                          cleaned_words[L] %in% predicted_words[1:2],
                          cleaned_words[L] %in% predicted_words[1],
                          0,
                          1/length(dictionary),
                          t)
            } else {
              
              results <- c(results,
                          cleaned_words[L] %in% predicted_words[1:3],
                          cleaned_words[L] %in% predicted_words[1:2],
                          cleaned_words[L] %in% predicted_words[1],
                          unique(!is.na(predicted_words)),
                          1/p_KBO[cleaned_words[L]],
                          t)
            }
            
        }
        
    }
    if (length(results)) {
      results2 <- matrix(results, ncol=6, byrow=TRUE)
      results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
      results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
      results2_means
    }
})

predictions_KBO_3 <- matrix(unlist(predictions_KBO_3), ncol=6, byrow=TRUE)
predictions_KBO_3_means <- sapply(data.frame(predictions_KBO_3), FUN=mean, na.rm=TRUE)
predictions_KBO_3[5] <- exp(sum(log(predictions_KBO_3[,5]))/length(predictions_KBO_3[,5]))


results_final_KBO <- data.frame(t(cbind(predictions_KBO_5_means,
                                        predictions_KBO_3_means)))
results_final_KBO[,6] <- round(results_final_KBO[,6],5)
results_final_KBO[,1:5] <- round(results_final_KBO[,1:5],3)
results_final_KBO[,5] <- round(results_final_KBO[,5],0)


colnames(results_final_KBO) <- c("Accuracy 3w","Accuracy 2w","Accuracy 1w",
                             "Coverage","Perplexity","Time (sec)")
rownames(results_final_KBO) <- c("KBO 5->1g","KBO 3->1g")
results_final_KBO

Conclusions

data.frame(rbind(results_final,
                 results_final_liniter,
                 results_final_laplace,
                 results_final_SBO,
                 results_final_KBO))

Data Science Capstone Milestone report

6/7/2020

Executive summary

Exploratory data analysis

Blogs

News

Twitter

Subsetting and filtering the dataset

Generating the training and testing dataset

Filtering the training dataset

Filtering non-english words (OOV words)

Stemming the dataset

Building the n-gram models

A bit of theory

1-gram model

2-gram model

3-gram model

4-gram model

5-gram model

Summary of the models

Testing the models

The testing dataset

Linear interpolation, smoothing and back-off algorithms.

Linear interpolation of the 1-gram, 2-gram and 3-gram models.

Laplace smoothing

Stupid backoff algorithm

Katz back-off algorithm

Conclusions

Appendix

Exploratory data analysis

Subsetting and filtering the dataset

Generating the training and testing dataset

Filtering the training dataset

Filtering non-english words (OOV words)

Stemming the dataset

Building the n-gram models

1-gram model

2-gram model

3-gram model

4-gram model

5-gram model

Summary of the models

Testing the models

The testing dataset

Linear interpolation, smoothing and back-off algorithms.

Linear interpolation of the 1-gram, 2-gram and 3-gram models.

Laplace smoothing

Stupid backoff algorithm

Katz back-off algorithm

Conclusions