The goal of the Data Science Capstone is to build a statistical models for text-prediction based on the SwiftKey US dataset. The models will be then deployed via a publicly available Shiny app. The current report will detail some basic features of the dataset and the strategy for model building. The analyses described in this summary are the basis for the development of the shinyapp Pre3d1ct0, already available at the website filippo-ciceri.shinyapps.io/Pr3d1ct0/. A quick introduction on Pr3d1ct0 is also available here.
The swiftKey US dataset provided for the analysis is divided in 3 sections (blogs, news and twitter). This section summarizes some of the basic features of each dataset.
899,28840,835231.69637,334,131
36,778,716157,5025,402,180253,86277,2595,760203.00242,643,969
2,569,10823,974415,82840,2072,360,14821368.8028130,373,545
29,284,611191,7905,351,139313,91640% of each SwiftKey dataset (blog, news, Twitter) was selected for building the statistical models, while the rest of the data (testing dataset) was reserved for performance evaluation.
First, the text was converted to lower-case, removing any sequence of characters containing digits and puctuation marks.
Next, prophanities were removed, based on a public list prophanities_list.txt, commonly used for spam filtering and parental control.
After these filtering steps the total number of words in the training dataset is 13,743,413.
The frequency of each word in the training dataset (Freq) was obtained using the table function of R. Probability (Maximum Likelihood Extimation - P) and cumulative probability (cum_P) were also calculated.
The words in the training dataset are distributed as follows:
132 words are providing 50% dataset coverage6,663 words are providing 90% dataset coverage18,557 words are providing 95% dataset coverageThe list of the 10 most common words is presented here below.
all_words Freq
1 the 494955
2 to 370589
3 a 296345
4 i 295004
5 and 228001
6 you 223462
7 in 191727
8 of 190306
9 for 175610
10 is 160774
Non-english words were removed from the dataset to speed up the analyses. This was done by comparison against a dictionary of the most common 10.000 words in the English language.
The percentage of out-of-vocabulary (OOV) words in the training dataset is 16.09. After removing OOV words words distribution is:
74 for 50% dataset coverage1,779 for 90% dataset coverage3,085 for 95% dataset coverageAnother way to compress the original dataset is to stem words, removing suffixes such as “ing”, “ed”, etc. For example, “working”, “worked” and “works” can be stemmed down to the shorter sequence of characters “work”. This can be easily done via the SnowballC library.
After stemming:
68 words are providing 50% dataset coverage1,132 words are providing 90% dataset coverage1,876 words are providing 95% dataset coverageHowever I believe that this brings excessive simplification that may be detrimental for the accuracy of the models. For this reason, the ngram models will be built without stemming the training dataset (see next section for more details).
Applying the chain rule of probability, probability for a sequence of words (\(P(w_1...w_n)\)) can be computed as
\(P(w_1^n)=P(w_1)P(w_2|w_1)P(w_3|w_1^2)...P(w_n|w_1^{n-1})=\prod_{k = 1}^{n}P(w_k|w_1^{k-1})\)
With the Markov assumption that the probability of each word depends only on the previous one we can simplify the previous formula to
\(P(w_1^n)\approx\prod_{k = 1}^{n}P(w_k|w_{k-1})\)
where \(n\) is the number of words in the sequence.
This is the base for building a prediction model based on pairs of words (2-grams).
More complex models (3-grams, 4-grams, 5-grams) are based on similar assumptions.
\(P(w_1^n)\approx\prod_{k = 1}^{n}P(w_k|w_{k-1}w_{k-2})\)
\(P(w_1^n)\approx\prod_{k = 1}^{n}P(w_k|w_{k-1}w_{k-2}w_{k-3})\)
\(P(w_1^n)\approx\prod_{k = 1}^{n}P(w_k|w_{k-1}w_{k-2}w_{k-3}w_{k-4})\)
The 1-gram model can be defined as the frequency of single words across the dataset. The probability of each single word is calculated according to the Maximum Likelihood Eximation using the table function.
The number of single words in the 1-gram model is 8,155.
The 2-gram model is calculated from the frequency of each pairs of words (2-grams) in the datasets. The list of 2-grams is calculated using the gregexpr("([a-z']+ [a-z']+)", text) command to select every combination of 2 sequences of characters separed by a space.
The total number of 2-grams in the training dataset is 6,785,482 (2,047,706 unique 2-grams).
After removing 2-grams containing prophanities and OOV words the number of unique 2-grams is 4,793,982 (842,593 unique 2-grams). The frequency of each 2-gram is stored in a dataframe for further manipulation.
The 2-frams in the training dataset are distributed as follows:
9,895 2-grams for 50% dataset coverage363,194 2-grams for providing 90% dataset coverage602,893 2-grams for providing 95% dataset coverageThe list of the 10 most common 2-grams is here below.
bigram freq
356477 in the 20957
500664 of the 17100
268981 for the 16728
509722 on the 12143
745652 to the 11124
742024 to be 10744
709342 thanks for 9582
58619 at the 9340
347207 i love 7835
296148 going to 7690
The plot above suggests that 2-grams frequency follow a Zipf distribution, where the vast majority of the 2-grams are used only rarely within the training dataset. In order to reduce the size of the model, such 2-grams will be removed because they provide little additional predictive power (as their probability is close to \(1/V\), the probability of extracting a random word from the dictionary - V is the length of 1-gram dictionary). The process of removing low probability n-grams is generally referred as “n-grams pruning”.
freq 2-grams removed N reduction
1 < 1 0 0.0000000
2 < 2 525651 0.1096481
3 < 3 769655 0.1605461
4 < 4 924377 0.1928203
5 < 5 1042109 0.2173786
6 < 6 1137554 0.2372879
7 < 7 1218716 0.2542179
8 < 8 1289087 0.2688969
9 < 9 1350007 0.2816045
10 < 10 1405447 0.2931690
The table above shows that removing 2-grams appearing less than 4 times in the dataset achieves a compression of approximately 19.28%. 2-grams appearing less than 4 times were removed from the dataset.
The bigram model is based on the conditional probability of a word \(w_{n}\), knowing the previous word \(w_{n-1}\). Via Maximum Likelihood Existimation, the probability for each couple of words \(w_{1}w_{0}\) is calculated from the observed frequencies in the training dataset and stored in a dataframe structure. The formula used for calculating bigram probabilities from observed frequencies in the dataset is:
\(P(w_{n}|w_{n-1}) = \frac{C(w_{n-1}w_n)}{\sum_{w}C(w_{n-1}w)}=\frac{C(w_{n-1}w_n)}{C(w_{n-1})}\)
Where \(P(w_{n}|w_{n-1})\) represents the conditional probability of the 2-gram \(w_{n-1}w\) and \(C(w_{n-1}w_n)\) the frequency of the same 2-gram in the training dataset.
The dataframe containing all the observed \(w_{n-1}w_n\) 2grams and their probabilities occupies 15.9 Mb. The dataframe can be “casted” into a matrix of 8155x8155 cells using the function acast from the reshape2 package. This matrix covers every possible combination of two words, most of them with an assigned probability of 0 because not observed in the training dataset. As expected, this matrix is considerably larger (334 Mb), compared with the dataframe of observed bigrams. The next sections will discuss in detail how to handle the probability of ngrams not observed in the training dataset.
The 3-gram model is calculated from the frequency of 3-words (3-grams) in the datasets. The sequence of steps is as described in the previous section.
The total number of 3-grams in the training dataset is 4,450,527 (3,201,840 unique 3-grams).
3-grams containing prophanities and OOV were removed (2,664,145 unique 3-grams after filtering), before calculating each frequency and related probability (Maximum Likelihood Eximation).
The 3-grams in our dataset are distributed as follows:
300,594 3-gram for 50% dataset coverage1,366,252 3-gram for 90% dataset coverage1,499,459 3-gram for 95% dataset coverageThe list of the 10 most common 3-grams is here below.
trigram freq
1264508 thanks for the 3931
1263405 thank you for 1417
626690 i love you 1359
779107 looking forward to 1357
633140 i want to 1103
17047 a lot of 1084
515264 going to be 1073
975177 one of the 1018
462364 for the follow 934
627880 i need to 930
As in the case of 2-grams, also the 3-gram model was pruned to reduce unnecessary information.
freq 3-grams removed N reduction
1 < 1 0 0.0000000
2 < 2 1361096 0.5108941
3 < 3 1661186 0.6235344
4 < 4 1796588 0.6743582
5 < 5 1884808 0.7074720
6 < 6 1947718 0.7310856
7 < 7 1997548 0.7497895
8 < 8 2037686 0.7648555
9 < 9 2071246 0.7774524
10 < 10 2100244 0.7883370
The table above shows that removing 3-grams appearing less than 4 times in the dataset achieves a compression of approximately 67.44%. The database was pruned accordingly.
The 3-grams models is based on the probability of any word \(w_n\), given the two preceding words \(w_{n-2}w_{n-1}\). The dataframe of observed 3-grams frequencies can be converted in a matrix containg 2-grams as rows and tail-words as columns, as described before for the 2-grams model. The formula used for computing probabilities is:
\(P(w_{n}|w_{n-1}w_{n-2}) = \frac{C(w_{n-2}w_{n-1}w_n)}{\sum_{w}C(w_{n-2}w_{n-1}w)}=\frac{C(w_{n-2}w_{n-1}w_n)}{C(w_{n-2}w_{n-1})}\)
Where \(P(w_{n}|w_{n-1}w_{n-2})\) represents the conditional probability of the 3-gram \(w_{n-2}w_{n-1}w\) and \(C(w_{n-2}w_{n-1}w_n)\) the frequency of the same 3-gram in the training dataset.
The size of the dataframe containing all the observed 3-grams is 11.4 Mb, the full 2-grams-vs-tailwords matrix occupies 611.1 Mb.
The 4-gram model is calculated from the frequency of 4-words (4-grams) in the datasets. The sequence of steps is very similar to the one described in the previous sections.
The total number of 4-grams in the training dataset is 1,670,158 (1,462,539 unique 4-grams), distributed as follows:
627,460 4-grams for 50% dataset coverage1,295,523 4-grams for 90% dataset coverage1,379,031 4-grams for 95% dataset coverageThe list of the 10 most common 4-grams is here below.
tetragram freq
1099342 thanks for the follow 887
1097347 thank you for the 441
1097815 thank you so much 310
605384 is going to be 278
385858 for the first time 267
511629 hope to see you 195
432248 going to be a 186
1099407 thanks for the mention 181
557041 i wish i could 173
565026 if you want to 173
freq 4-grams removed N reduction
1 < 1 0 0.0000000
2 < 2 1362667 0.8158911
3 < 3 1503975 0.9004986
4 < 4 1544211 0.9245898
5 < 5 1566779 0.9381023
6 < 6 1581594 0.9469727
7 < 7 1592034 0.9532236
8 < 8 1600042 0.9580183
9 < 9 1606738 0.9620275
10 < 10 1612111 0.9652446
Most of the 4-grams appear only once is the training dataset. Removing those 4-grams achieve a compression of 81.59%.
Probabilities were calcaluted from the frequencies observed in the dataset according to the formula
\(P(w_{n}|w_{n-1}w_{n-2}w_{n-3}) = \frac{C(w_{n-3}w_{n-2}w_{n-1}w_n)}{\sum_{w}C(w_{n-3}w_{n-2}w_{n-1}w)}=\frac{C(w_{n-3}w_{n-2}w_{n-1}w_n)}{C(w_{n-3}w_{n-2}w_{n-1})}\)
The 5-gram model is calculated from the frequency of 5-words (5-grams) in the training datasets. The sequence of steps is very similar to the one described in the previous sections.
The total number of 5-grams in the training dataset is 1,118,268 (1,069,577 unique 5-grams), distributed as follows:
510,443 5-grams for 50% dataset coverage957,750 5-grams for 90% dataset coverage1,013,663 5-grams for 95% dataset coverageThe list of the 10 most common 5-grams is here below.
pentagram freq
797047 thank you so much for 102
98832 at the end of the 84
796480 thank you for the follow 81
394095 i love you so much 69
800373 thanks so much for the 67
496726 let me know if you 64
366404 hope you have a great 57
543982 mi run with nike gps 56
274322 for the first time in 55
798911 thanks for the follow i 54
Probabilities were calcaluted from the frequencies observed in the dataset according to the formula
\(P(w_{n}|w_{n-1}|w_{n-2}|w_{n-3}|w_{n-4}) = \frac{C(w_{n-4}w_{n-3}w_{n-2}w_{n-1}w_n)}{\sum_{w}C(w_{n-4}w_{n-3}w_{n-2}w_{n-1}w)}=\frac{C(w_{n-4}w_{n-3}w_{n-2}w_{n-1}w_n)}{C(w_{n-4}w_{n-3}w_{n-2}w_{n-1})}\)
freq 5-grams removed N reduction
1 < 1 0 0.0000000
2 < 2 1035821 0.9262726
3 < 3 1092593 0.9770404
4 < 4 1100999 0.9845574
5 < 5 1105235 0.9883454
6 < 6 1107720 0.9905676
7 < 7 1109454 0.9921182
8 < 8 1110889 0.9934014
9 < 9 1111761 0.9941812
10 < 10 1112499 0.9948411
Most of the 5-grams appear only once is the dataset. So removing those 5-grams achieve a compression of 92.63%.
ngrams N Freq dataframe P dataframe Full P matrix
1-gram 8155 0.6 Mb 0.7 Mb <NA>
2-gram 143366 14.3 Mb 15.9 Mb 334 Mb
3-gram 76392 10.6 Mb 11.4 Mb 611.1 Mb
4-gram 99872 18.5 Mb 19.6 Mb 2620.6 Mb
5-gram 33756 8 Mb 8.4 Mb 885.3 Mb
As previously said, the probability dataframe includes only the probability for ngrams observed in the training dataset, whereas the full probability matrix cover all the possible combinations (the majority of which are assigned with a probability of 0 because unobserved in the training dataset). By comparing the sizes of probability dataframe and matrix for each ngram model it’s clear that the vast majority of the probability matrix of an ngram model is occupied by 0s (sparse datataset). For this reason, instead of keeping in memory the whole matrix it’s more convenient (in term of system resources) to store a dataframe of only the non-zero probabilities, assuming that ngrams excluded from the matrix have probability equal to 0.
In order to save some time, we used only a small fraction of the testing dataset for evaluating the accuracy of the different ngrams models (6,006 words, 231 lines). The accuracy of the various ngram models was evaluated predicting each word based on the 1/2/3/4 previous ones, according with the model used (2/3/4/5-gram model). The time required for performing each prediction was also measured.
Accuracy 3w Accuracy 2w Accuracy 1w Coverage Perplexity Time (sec)
2-grams 0.15 0.13 0.09 0.79 NA 0.00376
3-grams 0.10 0.08 0.06 0.40 NA 0.00375
4-grams 0.03 0.03 0.03 0.14 NA 0.00762
5-grams 0.01 0.01 0.00 0.02 NA 0.00209
The table above summarizes the performance of the various models in terms of mean accuracy (accuracy on 3rd, 2nd and 1st word), testing dataset coverage and prediction time (in seconds). Perplexity was not calculated because of the existence of ngrams not observed in the training dataset (coverage<1). Perplexity (\(PP\)) is the inverse probability of the test set (normalized for the number of words), therefore unobserved ngrams with probability equal to 0 imply an infinite value for perplexity. In the next section we will illustrate some algorithms that combining different ngrams models are able to account for the probability of ngrams not observed in the training dataset.
\(PP(W)=\sqrt[n]{\prod_{i=1}^{n}{\frac{1}{P(w_i|w_{i-n+1}...w_{i-1})}}}\)
Linear interpolation, smoothing and backoff are used to calculate the probability for ngrams not observed in the training dataset.
The idea behind linear interpolation is to combine the values of probabilities from the different n-grams models using a set of weights \(w_1\), \(w_2\) and \(w_3\) (where \(w_1\) corresponds to the weight assigned to probabilities from the 1-gram model, \(w_2\) to the 2-gram model and so on).
\(P_{int}(w_n|w_{n-1}w_{n-2})=w_1P(w_n)+w_2P(w_n|w_{n-1})+w_3P(w_n|w_{n-1}w_{n-2})\)
\(w_{1}+w_{2}+w_{3}=1\)
We generated 500 random combinations of 3 values between 0.0000001 and 1, then tested them against a subset of the training datasets. The combination providing the highest predictive power was then used to interpolate probabilites from the 1-gram, 2-gram and 3-gram models, evaluated against the testing dataset.
Out of the 500 weights combinations tested 269 achieves the highest accuracy of 0.21 on the third word predicted. We are going to use the first of highest scoring combination (max_Accuracy_weights[1,c("w1","w2","w3")]) to evaluate the efficacy of linear interpolation between the 1-, 2- and 3-gram models. The code below shows how linear interpolation works (results presented in a table format).
# Inputs
final_weights <- max_Accuracy_weights[1,c("w1","w2","w3")]
text <- tolower("I will be")
words <- unlist(strsplit(tolower(text)," "))
# Probabilities calculated using the 1- 2- and 3-gram models
p_trigrams <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1], words[2], sep=" ") &
ngrams_3_freq['word3']==words[3], "P"]
p_bigrams <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] &
ngrams_2_freq['word2']==words[3], "P"]
p_unigrams <- 1/length(dictionary)
# Weighted probabilities for linear interpolation
p_unigrams_w <- p_unigrams * final_weights[1]
p_bigrams_w <- p_bigrams * final_weights[2]
p_trigrams_w <- p_trigrams * final_weights[3]
# Final probability
final_p <- p_unigrams_w + p_bigrams_w + p_trigrams_w
# Output table
values <- unname(c(p_unigrams,p_bigrams,p_trigrams,final_weights,
p_unigrams_w,p_bigrams_w,p_trigrams_w,
NA,NA,final_p))
matrix <- matrix(values, ncol=3, byrow=TRUE)
colnames(matrix) <- c("1-gram model","2-gram model","3-gram model")
rownames(matrix) <- c("P","weights","weighted P","total P")
round(matrix,3)
1-gram model 2-gram model 3-gram model
P 0.000 0.302 0.256
weights 0.131 0.176 0.694
weighted P 0.000 0.053 0.177
total P NA NA 0.230
The table below shows the performance of linear interpolation of 1-, 2- and 3-gram models on the testing dataset.
Accuracy 3w Accuracy 2w Accuracy 1w Coverage Perplexity Time (sec)
Linear int. 0.198 0.163 0.119 1 2414 0.03233
Laplace smoothing calculates the probability for unseen ngrams by adding one (or \(k\)) unit to the count of all the possible ngrams (both observed and unobserved). In this way the the associated probability for unobserved ngrams increases from 0 to \(1/N\) (where N is the number of unique ngrams observed in the training dataset). The formulas for calculating the ngrams probability according to the Maximum Likelihood Exitimation is adapted accordingly and presented below (for the 2-gram model).
\(P(w_{n}|w_{n-1}) = \frac{C(w_{n-1}w_n)+1}{C(w_{n-1})+V}\)
Where \(V\) is the length of the dictionary of individual words (1-grams).
Below there is the imprementation with R code, showing the p associated with the 2-gram ‘will be’.
aggregate_2_grams <- aggregate(ngrams_2_freq$freq, by=list(Category=ngrams_2_freq$word1), FUN=sum)
get_bigrams_p_laplace_smoothing <- function(x) {
words <- unlist(strsplit(tolower(x)," ")[1])
if (length(words)>1 & (!"" %in% words)) {
head_word <- words[(length(words)-1)]
tail_word <- words[length(words)]
all_gram_length <- length(ngrams_1_freq$ngram)
all_bigrams <- paste(head_word,ngrams_1_freq$ngram, sep=" ")
#Named vector of unobserved 2-grams
freq_0_bigrams <- all_bigrams[!all_bigrams %in% ngrams_2_freq$bigram]
freq_0_bigrams_named_vector <- rep(0,length(freq_0_bigrams))
names(freq_0_bigrams_named_vector) <- freq_0_bigrams
#Named vector of observed 2-grams
freq_non_0_bigrams <- as.character(ngrams_2_freq[ngrams_2_freq['word1']==head_word, 'bigram'])
freq_non_0_bigrams_vector <- ngrams_2_freq[ngrams_2_freq['word1']==head_word, 'freq']
names(freq_non_0_bigrams_vector) <- freq_non_0_bigrams
#Calculate the named vector of sorted probabilities according to the MLE extimation with
#add-1 smoothing (Lapace smoothing)
freq_0_bigrams_named_vector_p <- (freq_0_bigrams_named_vector + 1)/all_gram_length
freq_non_0_bigrams_vector_p <- (freq_non_0_bigrams_vector+1)/
(aggregate_2_grams[aggregate_2_grams['Category']==head_word,'x'] +
all_gram_length)
final_p <- sort(c(freq_0_bigrams_named_vector_p,
freq_non_0_bigrams_vector_p),decreasing=TRUE)
if (!is.na(final_p[paste(head_word, tail_word, sep=" ")])) {
perplexity <- 1/final_p[paste(head_word, tail_word, sep=" ")]
} else {
perplexity <- length(dictionary)
}
list(final_p, perplexity)
} else {
list(NA,NA)
}
}
example <- "will be"
results <- get_bigrams_p_laplace_smoothing(example)
results[[1]]['will be']
will be
0.2138484
Accuracy 3w Accuracy 2w Accuracy 1w Coverage Perplexity Time (sec)
Laplace 5g 0.076 0.055 0.037 0.893 77 0.00679
Laplace 4g 0.094 0.074 0.056 0.907 105 0.01550
Laplace 3g 0.147 0.118 0.081 0.914 70 0.01117
Laplace 2g 0.166 0.136 0.093 0.938 371 0.01368
Backoff algorithms resort to lower ngrams models when they encounter a combination of words with an associated probability of 0 (i.e. not present in the training dataset). For example, if the probability for the 3-gram “I will be” is 0, then the algorithm will consider the probability of the 2-gram “will be” and so on. The simplest backoff is called “Stupid backoff”. This approach discounts 0.4 factor for the probability of any lower-lever n-gram model. An example is shown here below (formula and R implementation).
\(P(w_i|w_{i-n+1}...w_{i-1})=\begin{cases}\frac{C(w_iw_{i-n+1}...w_{i-1})}{C(w_{i-n+1}...w_{i-1})},& \text{if } C(w_iw_{i-n+1}...w_{i+1})> 0\\0.4*P_{bo}(w_i|w_{i-n+2}...w_{i-1})& \text{otherwise }\end{cases}\)
Where \(P_{bo}\) represents the probability associated with the lower level n-gram. This process can be repeated until the algorythm encounters \(P_{bo}>0\) or the lower level ngram model is reached (the 1-gram model, where each word is associated with a probability equal to \(1/V\) - where \(V\) is the length of the 1-gram dictionary). The code below shows the steps required for calculating the probability associated with the 5-grams “thank you so much for” and “I will be a mechanic” using the SBO algorithm.
# process words
example1 <- "Thank you so much for this"
example2 <- "I will be a mechanic"
stupid_back_off <- function(x) {
words <- strsplit(tolower(x)," ")[[1]]
#calculates probability based on 5-gram model
p <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(words[1:4], collapse=" ") &
ngrams_5_freq['word5'] == words[5], 'P']
#back-off to 4-gram model
if (length(p)==0) {
p <- 0.4*ngrams_4_freq[ngrams_4_freq['trigram']==paste(words[2:4], collapse=" ") &
ngrams_4_freq['word4'] == words[5], 'P']
}
#back-off to 3-gram model
if (length(p)==0) {
p <- 0.4*0.4*ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[3:4], collapse=" ") &
ngrams_3_freq['word3'] == words[5], 'P']
}
#back-off to 2-gram model
if (length(p)==0) {
p <- 0.4*0.4*0.4*ngrams_2_freq[ngrams_2_freq['word1']==words[4] &
ngrams_2_freq['word2'] == words[5], 'P']
}
#back-off to 1-gram model
if (length(p)==0) {
p <- 0.4*0.4*0.4*0.4*(1/length(dictionary))
}
p
}
stupid_back_off(example1)
[1] 0.7083333
stupid_back_off(example2)
[1] 3.139178e-06
When tested on the training dataset, this is the performance of the Stupid back-off algorithm on the 5-, 4-, 3-, 2- and 1-gram models.
Accuracy 3w Accuracy 2w Accuracy 1w Coverage Perplexity Time (sec)
SBO 5->1g 0.164 0.135 0.094 0.756 23587 0.01585
SBO 3->1g 0.184 0.151 0.105 0.794 2034 0.00760
A more advanced version of back-off is the Katz back-off algorithm. In this case a certain amount of probability is discounted from the higher level ngram model and re-distributed to the lower level to account for unobserved ngrams not included in the training dataset. This solution is more matematically correct because the total probability associated with the enter set of ngrams models is set equal to 1. For example, discounting 0.2 from the 3-gram model will results in 0.8 total probability remaining and 0.2 redistribuited to the 2-gram model. This process is repeated as described for every lower ngram model. The percentage of probability to be discounted can be defined by the user, extimated by via simulations or using the Good-Turing method.[Good-Turing frequency extimation] (https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation) is commonly used for calculating the frequency of unobserved events and it is implemented in the EdgeR package via the goodTuring function. An example is presented here below (formula and R implementation with the ‘i will be’ 3-gram).
\(P(w_i|w_{i-n+1}...w_{i-1})=\begin{cases}d_{w_{i-n+1}...w_{i}}\frac{C(w_iw_{i-n+1}...w_{i-1})}{C(w_{i-n+1}...w_{i-1})},& \text{if } C(w_iw_{i-n+1}...w_{i+1})> k\\\alpha_{w_{i-n+1}...w_{i}}P_{bo}(w_i|w_{i-n+2}...w_{i-1})& \text{otherwise }\end{cases}\)
where
\(\alpha_{w_{i-n+1}...w_{i}}=\frac{\beta_{w_{1-n+1}...w_{i-1}}}{\sum_{w_i:C(w_{i-n+1}...w_i)\leq k}P_{bo}(w_i|w_{i-n+2}...w_{i-1})}\)
with
\(\beta_{w_{1-n+1}...w_{i-1}} = 1 - \sum_{w_i:C(w_{i-n+1}...w_i)>k}d_{w_{i-n+1}...w_{i}}\frac{C(w_{i-n+1}...w_{i-1}w_i)}{C(w_{i-n+1}...w_{i-1})}\)
# process words
library(edgeR)
#calculate discounts based on the Good-Turing extimation for unobserved ngrams
discount_5_gram <- sum(goodTuring(ngrams_5_freq$freq)['proportion'][[1]])
discount_4_gram <- sum(goodTuring(ngrams_4_freq$freq)['proportion'][[1]])
discount_3_gram <- sum(goodTuring(ngrams_3_freq$freq)['proportion'][[1]])
discount_2_gram <- sum(goodTuring(ngrams_2_freq$freq)['proportion'][[1]])
example <- "i will be"
words <- unlist(strsplit(example," "))
#discounted 3-gram probabilites
p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1:2],collapse=" "), 'P'] * discount_3_gram
tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1:2],collapse=" "), 'word3']
names(p_3_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_3_gram)
discounted_P_3_gram <- 1 - sum(p_3_gram)
#discounted 2-gram probabilites
p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] &
!ngrams_2_freq$word2 %in% tail_word_3_gram, 'P']
p_2_gram <- p_2_gram * discounted_P_3_gram/sum(p_2_gram) * discount_2_gram
tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] &
!ngrams_2_freq$word2 %in% tail_word_3_gram, 'word2']
names(p_2_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_2_gram, sep=" ")
discounted_P_2_gram <- 1 - sum(p_2_gram) - sum(p_3_gram)
#discounted 1-gram probabilites
tail_word_1_gram <- dictionary[!dictionary %in% tail_word_3_gram &
!dictionary %in% tail_word_2_gram]
p_1_gram <- 1/length(dictionary) * discounted_P_2_gram/sum(1/length(dictionary) * length(tail_word_1_gram))
p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))
names(p_1_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_1_gram)
#final probability vector
p_final <- c(p_3_gram, p_2_gram, p_1_gram)
p_final['i will be']
i will be
0.02666113
The performance of the algorithm implementing the Katz backoff on the training dataset is summarized in the table here below.
Accuracy 3w Accuracy 2w Accuracy 1w Coverage Perplexity Time (sec)
KBO 5->1g 0.149 0.125 0.081 0.756 959 0.01677
KBO 3->1g 0.167 0.140 0.094 0.794 611 0.00819
The table below compares the performances of individual models and algorithms based on linear interpolation, Laplace smoothing, Stupid and Katz backoff. Linear interpolation was performed to merge together the probability distributions of 3-, 2- and 1-grams, whereas Laplace smoothing was applied to all the models. The backoff algorythms were used to compute the probabilites backing-off from the 5-gram and the 3-gram models down to 1-grams (5->1-gram and 3->1-gram).
Accuracy.3w Accuracy.2w Accuracy.1w Coverage Perplexity Time..sec.
2-grams 0.150 0.130 0.090 0.790 NA 0.00376
3-grams 0.100 0.080 0.060 0.400 NA 0.00375
4-grams 0.030 0.030 0.030 0.140 NA 0.00762
5-grams 0.010 0.010 0.000 0.020 NA 0.00209
Linear int. 0.198 0.163 0.119 1.000 2414 0.03233
Laplace 5g 0.076 0.055 0.037 0.893 77 0.00679
Laplace 4g 0.094 0.074 0.056 0.907 105 0.01550
Laplace 3g 0.147 0.118 0.081 0.914 70 0.01117
Laplace 2g 0.166 0.136 0.093 0.938 371 0.01368
SBO 5->1g 0.164 0.135 0.094 0.756 23587 0.01585
SBO 3->1g 0.184 0.151 0.105 0.794 2034 0.00760
KBO 5->1g 0.149 0.125 0.081 0.756 959 0.01677
KBO 3->1g 0.167 0.140 0.094 0.794 611 0.00819
The results above shows that the integrating different n-gram models clearly improves prediction accuracy. Linear interpolation of the 3-, 2- and 1-gram model achieves the highest score in terms of accuracy, whereas the Kats back-off algorithm scores the highest in terms of perplexity (with a considerably lower coverage, only slightly better compared with the much simplier 2-gram model). As expected, the individual n-gram models are considerably faster compared with more complex algorythm, however the lower accuracy does not justify their use in the final shinyapp. Surprisingly, the Stupid back-off algorithm has an accuracy superior to the Katz backoff, but with an higher perplexity. Back-off algorythm including 5 models (from 5- to 1-gram models) are considerably slower than those including only 3 models (from 3- to 1-gram models), without a clear improvement in terms of accuracy and perplexity. Laplace smoothing is not brilliant neither in terms of performance nor in terms of accuracy.
The linear interpolation, Stupid and Katz back-off algorithms were finally selected for developing the shinyapp Pre3d1ct0, already available at the website filippo-ciceri.shinyapps.io/Pr3d1ct0/. A quick introduction on Pr3d1ct0 is also available here.
data_blogs <- readLines("Coursera-SwiftKey-en_US\\en_US.blogs.txt")
number_lines_blogs <- format(length(data_blogs), nsmall=0, big.mark=',')
line_length_blogs <- unlist(lapply(data_blogs, nchar))
max_line_length_blogs <- format(max(line_length_blogs), nsmall=0, big.mark=',')
min_line_length_blogs <- format(min(line_length_blogs), nsmall=0, big.mark=',')
mean_line_length_blogs <- format(mean(line_length_blogs), nsmall=3, big.mark=',')
text_blogs <- paste(data_blogs, collapse=" ")
all_words_blogs <- strsplit(text_blogs, " ")[[1]]
characters_only_blogs <- all_words_blogs[grep("^[a-zA-Z]+$",all_words_blogs)]
digits_only_blogs <- all_words_blogs[grep("^[0-9]+$",all_words_blogs)]
characters_and_digits_blogs <- all_words_blogs[grep("^[a-zA-Z][0-9]+$", all_words_blogs)]
x <- all_words_blogs[grep("[a-zA-Z][^0-9]",all_words_blogs)]
characters_and_punctuations_blogs <- x[!(x %in% characters_only_blogs)]
y <- all_words_blogs[grep("[0-9]+",all_words_blogs)]
digits_and_punctuations_blogs <- y[!(y %in% digits_only_blogs)]
rm(x)
rm(y)
characters_and_punctuations_blogs <- gsub("â???T","'",characters_and_punctuations_blogs)
characters_and_punctuations_blogs <- gsub("[â???\\W\\W]","",characters_and_punctuations_blogs)
characters_and_punctuations_blogs <- gsub("^\\W|\\W$","",characters_and_punctuations_blogs)
characters_only_blogs <- c(characters_only_blogs, gsub("^\\W|\\W$","",characters_and_punctuations_blogs))
all_words_blogs_n <- format(length(all_words_blogs), nsmall=0, big.mark=',')
characters_only_blogs <- format(length(characters_only_blogs), nsmall=0, big.mark=',')
digits_only_blogs <- format(length(digits_only_blogs), nsmall=0, big.mark=',')
characters_and_punctuations_blogs <- format(length(characters_and_punctuations_blogs), nsmall=0, big.mark=',')
digits_and_punctuations_blogs <- format(length(digits_and_punctuations_blogs), nsmall=0, big.mark=',')
rm(text_blogs)
data_news <- readLines("Coursera-SwiftKey-en_US\\en_US.news.txt")
number_lines_news <- format(length(data_news), nsmall=0, big.mark=',')
line_length_news <- unlist(lapply(data_news, nchar))
max_line_length_news <- format(max(line_length_news), nsmall=0, big.mark=',')
min_line_length_news <- format(min(line_length_news), nsmall=0, big.mark=',')
mean_line_length_news <- format(mean(line_length_news), nsmall=3, big.mark=',')
text_news <- paste(data_news, collapse=" ")
all_words_news <- strsplit(text_news, " ")[[1]]
characters_only_news <- all_words_news[grep("^[a-zA-Z]+$",all_words_news)]
digits_only_news <- all_words_news[grep("^[0-9]+$",all_words_news)]
characters_and_digits_news <- all_words_news[grep("^[a-zA-Z][0-9]+$", all_words_news)]
x <- all_words_news[grep("[a-zA-Z][^0-9]",all_words_news)]
characters_and_punctuations_news <- x[!(x %in% characters_only_news)]
y <- all_words_news[grep("[0-9]+",all_words_news)]
digits_and_punctuations_news <- y[!(y %in% digits_only_news)]
rm(x)
rm(y)
characters_and_punctuations_news <- gsub("â???T","'",characters_and_punctuations_news)
characters_and_punctuations_news <- gsub("[â???\\W\\W]","",characters_and_punctuations_news)
characters_and_punctuations_news <- gsub("^\\W|\\W$","",characters_and_punctuations_news)
characters_only_news <- c(characters_only_news, gsub("^\\W|\\W$","",characters_and_punctuations_news))
all_words_news_n <- format(length(all_words_news), nsmall=0, big.mark=',')
characters_only_news <- format(length(characters_only_news), nsmall=0, big.mark=',')
digits_only_news <- format(length(digits_only_news), nsmall=0, big.mark=',')
characters_and_punctuations_news <- format(length(characters_and_punctuations_news), nsmall=0, big.mark=',')
digits_and_punctuations_news <- format(length(digits_and_punctuations_news), nsmall=0, big.mark=',')
rm(text_news)
data_twitter <- readLines("Coursera-SwiftKey-en_US\\en_US.twitter.txt")
number_lines_twitter <- format(length(data_twitter), nsmall=0, big.mark=',')
line_length_twitter <- unlist(lapply(data_twitter, nchar))
max_line_length_twitter <- format(max(line_length_twitter), nsmall=0, big.mark=',')
min_line_length_twitter <- format(min(line_length_twitter), nsmall=0, big.mark=',')
mean_line_length_twitter <- format(mean(line_length_twitter), nsmall=3, big.mark=',')
text_twitter <- paste(data_twitter, collapse=" ")
all_words_twitter <- strsplit(text_twitter, " ")[[1]]
characters_only_twitter <- all_words_twitter[grep("^[a-zA-Z]+$",all_words_twitter)]
digits_only_twitter <- all_words_twitter[grep("^[0-9]+$",all_words_twitter)]
characters_and_digits_twitter <- all_words_twitter[grep("^[a-zA-Z][0-9]+$", all_words_twitter)]
x <- all_words_twitter[grep("[a-zA-Z][^0-9]",all_words_twitter)]
characters_and_punctuations_twitter <- x[!(x %in% characters_only_twitter)]
y <- all_words_twitter[grep("[0-9]+",all_words_twitter)]
digits_and_punctuations_twitter <- y[!(y %in% digits_only_twitter)]
rm(x)
rm(y)
characters_and_punctuations_twitter <- gsub("â???T","'",characters_and_punctuations_twitter)
characters_and_punctuations_twitter <- gsub("[â???\\W\\W]","",characters_and_punctuations_twitter)
characters_and_punctuations_twitter <- gsub("^\\W|\\W$","",characters_and_punctuations_twitter)
characters_only_twitter <- c(characters_only_twitter, gsub("^\\W|\\W$","",characters_and_punctuations_twitter))
all_words_twitter_n <- format(length(all_words_twitter), nsmall=0, big.mark=',')
characters_only_twitter <- format(length(characters_only_twitter), nsmall=0, big.mark=',')
digits_only_twitter <- format(length(digits_only_twitter), nsmall=0, big.mark=',')
characters_and_punctuations_twitter <- format(length(characters_and_punctuations_twitter), nsmall=0, big.mark=',')
digits_and_punctuations_twitter <- format(length(digits_and_punctuations_twitter), nsmall=0, big.mark=',')
rm(text_twitter)
set.seed(1234)
training_dataset <- c(data_news[sample(length(data_blogs), 0.4*length(data_blogs), replace=FALSE)],
data_news[sample(length(data_news), 0.4*length(data_news), replace=FALSE)],
data_twitter[sample(length(data_twitter), 0.4*length(data_twitter), replace=FALSE)])
testing_dataset <- c(data_blogs[!data_blogs %in% training_dataset],
data_news[!data_news %in% training_dataset],
data_twitter[!data_twitter %in% training_dataset])
training_dataset <- training_dataset[!is.na(training_dataset)]
testing_dataset <- testing_dataset[!is.na(testing_dataset)]
rm(data_news)
rm(data_twitter)
rm(data_blogs)
rm(all_words_blogs)
rm(all_words_news)
rm(all_words_twitter)
training_dataset <- tolower(gsub("[^a-zA-Z' ]","",training_dataset))
training_dataset <- gsub("( ') | (' ) | ^' | '$","",training_dataset)
text_training <- paste(training_dataset, collapse=" ")
all_words <- regmatches(text_training,gregexpr("[a-z']+",text_training))[[1]]
all_words_n_1 <- format(length(all_words), nsmall=0, big.mark=',')
prophanities <- readLines("prophanities_list.txt")
prophanities <- gsub('.{3}$', '', prophanities)
all_words <- all_words[!all_words %in% prophanities]
all_words_n_2 <- format(length(all_words), nsmall=0, big.mark=',')
word_frequency_df <- data.frame(sort(table(all_words), decreasing =TRUE))
word_frequency_df['P'] <- word_frequency_df$Freq / sum(word_frequency_df$Freq)
word_frequency_df['cum_P'] <- cumsum(word_frequency_df$P)
# plot(word_frequency_df[word_frequency_df['cum_P']<0.9,'Freq'],
# ylim=c(0,100000),
# ylab='Freq in all_words',
# xlab='Index of words providing 90% dataset coverage')
coverage_50 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.5,])[1]
coverage_90 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.9,])[1]
coverage_95 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.95,])[1]
coverage_50 <- format(coverage_50, nsmall=0, big.mark=',')
coverage_90 <- format(coverage_90, nsmall=0, big.mark=',')
coverage_95 <- format(coverage_95, nsmall=0, big.mark=',')
word_frequency_df[1:10,c("all_words","Freq")]
english_dict <- readLines('english_dict_10000.csv')
english_dict <- tolower(english_dict)
english_dict <- unique(english_dict[grep("[a-zA-Z']+",english_dict)])
p_OOV <- mean(!all_words %in% english_dict)*100
p_OOV <- format(round(as.numeric(p_OOV),2), nsmall=2)
all_words <- all_words[all_words %in% english_dict]
word_frequency_df <- data.frame(sort(table(all_words), decreasing =TRUE))
word_frequency_df['P'] <- word_frequency_df$Freq / sum(word_frequency_df$Freq)
word_frequency_df['cum_P'] <- cumsum(word_frequency_df$P)
# plot(word_frequency_df[word_frequency_df['cum_P']<0.9,'Freq'],
# ylim=c(0,100000),
# ylab='Freq in all_words',
# xlab='Index of words providing 90% dataset coverage')
coverage_50 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.5,])[1]
coverage_90 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.9,])[1]
coverage_95 <- dim(word_frequency_df[word_frequency_df['cum_P']<=0.95,])[1]
coverage_50 <- format(coverage_50, nsmall=0, big.mark=',')
coverage_90 <- format(coverage_90, nsmall=0, big.mark=',')
coverage_95 <- format(coverage_95, nsmall=0, big.mark=',')
# word_frequency_df[word_frequency_df['all_words']=='agreeing',]
library(SnowballC)
all_words_stemmed <- wordStem(all_words, language = "en")
word_frequency_df_stemmed <- data.frame(sort(table(all_words_stemmed), decreasing =TRUE))
word_frequency_df_stemmed['P'] <- word_frequency_df_stemmed$Freq/sum(word_frequency_df_stemmed$Freq)
word_frequency_df_stemmed['cumP'] <- cumsum(word_frequency_df_stemmed$P)
coverage_50_stemmed <- word_frequency_df_stemmed[word_frequency_df_stemmed['cumP']<=0.50,]
coverage_90_stemmed <- word_frequency_df_stemmed[word_frequency_df_stemmed['cumP']<=0.90,]
coverage_95_stemmed <- word_frequency_df_stemmed[word_frequency_df_stemmed['cumP']<=0.95,]
coverage_50_stemmed <- format(dim(coverage_50_stemmed)[1], nsmall=0, big.mark=',')
coverage_90_stemmed <- format(dim(coverage_90_stemmed)[1], nsmall=0, big.mark=',')
coverage_95_stemmed <- format(dim(coverage_95_stemmed)[1], nsmall=0, big.mark=',')
rm(all_words_stemmed)
rm(word_frequency_df_stemmed)
ngrams_1 <- all_words
ngrams_1_freq <- word_frequency_df
names(ngrams_1_freq) <- c("ngram","Freq","P","cum_P")
ngrams_1_freq['ngram'] <- as.character(ngrams_1_freq$ngram)
dictionary <- ngrams_1_freq$ngram
text_training <- tolower(text_training)
ngrams_2 <- regmatches(text_training, gregexpr("([a-z']+ [a-z']+)", text_training))[[1]]
ngrams_2 <- ngrams_2[lapply(ngrams_2,length)>0]
total_ngrams_2 <- format(length(ngrams_2), big.mark=',')
unique_ngrams_2 <- format(length(unique(ngrams_2)), big.mark=',')
ngrams_2_2 <- matrix(unlist(strsplit(ngrams_2," ")), ncol=2, byrow=TRUE)
ngrams_2 <- ngrams_2[(!ngrams_2_2[,1] %in% prophanities)&
(!ngrams_2_2[,2] %in% prophanities)&
(ngrams_2_2[,1] %in% english_dict)&
(ngrams_2_2[,2] %in% english_dict)]
rm(ngrams_2_2)
total_ngrams_2 <- format(length(ngrams_2), big.mark=',')
unique_ngrams_2 <- format(length(unique(ngrams_2)), big.mark=',')
ngrams_2_freq <- data.frame(table(ngrams_2))
names(ngrams_2_freq) <- c("bigram","freq")
ngrams_2_freq['bigram'] <- as.character(ngrams_2_freq$bigram)
ngrams_2_freq['word1'] <- sapply(strsplit(ngrams_2_freq$bigram," "), `[`, 1)
ngrams_2_freq['word2'] <- sapply(strsplit(ngrams_2_freq$bigram," "), `[`, 2)
ngrams_2_freq <- ngrams_2_freq[,c('bigram','word1','word2','freq')]
ngrams_2_freq <- ngrams_2_freq[order(ngrams_2_freq$freq,decreasing = TRUE),]
ngrams_2_freq['cum_freq'] <- cumsum(ngrams_2_freq$freq)/sum(ngrams_2_freq$freq)
plot(ngrams_2_freq[ngrams_2_freq['cum_freq']<0.9,'freq'],
ylim=c(0,1000),
ylab='Freq in training dataset',
xlab='Index of 2-grams providing 90% dataset coverage',
type='l',
main='2-grams sorted based on frequency')
coverage_50 <- ngrams_2_freq[ngrams_2_freq['cum_freq']<=0.5,]
coverage_90 <- ngrams_2_freq[ngrams_2_freq['cum_freq']<=0.9,]
coverage_95 <- ngrams_2_freq[ngrams_2_freq['cum_freq']<=0.95,]
coverage_50 <- format(dim(coverage_50)[1], nsmall=0, big.mark=',')
coverage_90 <- format(dim(coverage_90)[1], nsmall=0, big.mark=',')
coverage_95 <- format(dim(coverage_95)[1], nsmall=0, big.mark=',')
ngrams_2_freq[1:10,c("bigram","freq")]
pruning <- 1:10
ngrams_2_number <- sapply(pruning, FUN=function(x) {
sum(ngrams_2_freq[ngrams_2_freq['freq']<x,'freq'])
})
compression <- sapply(ngrams_2_number, FUN=function(x) {
x/length(ngrams_2)
})
df_p <- data.frame(pruning,ngrams_2_number,compression)
df_p[,'pruning'] <- sapply(df_p[,'pruning'], FUN=function(x){
paste("<",as.character(x))
})
names(df_p) <- c("freq 2-grams removed","N","reduction")
df_p
ngrams_2_freq <- ngrams_2_freq[ngrams_2_freq['freq']>3,]
# here below there is my first approach to calculate 2-gram probabilities
# it is not incorrect per se, but it generate problems when the pruning process removes different
# ngrams from the 2-, 3-, 4- and 5-grams models.
#
# bigram_matrix <- merge(ngrams_2_freq[,c('word1','word2','freq')],
# ngrams_1_freq[,c('ngram','Freq')],
# by.x="word1",
# by.y="ngram",
# all=TRUE)
#
# the new approach deals better with the consequences of the pruning process
aggregate_1_gram <- aggregate(ngrams_2_freq$freq, by=list(Category=ngrams_2_freq$word1), FUN=sum)
ngrams_2_freq <- merge(ngrams_2_freq,
aggregate_1_gram,
by.x='word1',
by.y='Category')
names(ngrams_2_freq)[6] <- c('word1_freq')
ngrams_2_freq['P'] <- ngrams_2_freq['freq']/ngrams_2_freq['word1_freq']
# dim(ngrams_2_freq[is.na(ngrams_2_freq['word2']),])
# dim(ngrams_2_freq[is.na(ngrams_2_freq['word1']),])
# ngrams_2_freq[is.na(ngrams_2_freq['P']),'P']
size_bigram_freq <- format(object.size(ngrams_2_freq[,c(1:5)]), units = "Mb", standard = "auto", digits = 1L)
size_bigram_P <- format(object.size(ngrams_2_freq), units = "Mb", standard = "auto", digits = 1L)
bigram_matrix <- acast(ngrams_2_freq, word1~word2, value.var="P")
# sum(bigram_matrix_2[is.na(bigram_matrix_2)])
size_bigram_matrix <- format(object.size(bigram_matrix), units = "Mb", standard = "auto", digits = 1L)
# table(ngrams_2)['i will']/table(ngrams_1)['i']
# ngrams_2_freq[ngrams_2_freq['word1']=='i'& ngrams_2_freq['word2']=='will','P']
# sum(ngrams_2_freq[ngrams_2_freq['word1']=='i','P'])
# bigram_matrix['i','will']
text_training <- tolower(text_training)
ngrams_3 <- regmatches(text_training,gregexpr("([a-z']+ [a-z']+ [a-z']+)",text_training))[[1]]
ngrams_3 <- ngrams_3[!is.na(ngrams_3)]
total_ngrams_3 <- format(length(ngrams_3), big.mark=',')
unique_ngrams_3 <- format(length(unique(ngrams_3)), big.mark=',')
ngrams_3_2 <- matrix(unlist(strsplit(ngrams_3," ")), ncol=3, byrow=TRUE)
ngrams_3 <- ngrams_3[(!ngrams_3_2[,1] %in% prophanities)&
(!ngrams_3_2[,2] %in% prophanities)&
(!ngrams_3_2[,3] %in% prophanities)&
(ngrams_3_2[,1] %in% english_dict)&
(ngrams_3_2[,2] %in% english_dict)&
(ngrams_3_2[,3] %in% english_dict)]
rm(ngrams_3_2)
total_ngrams_3 <- format(length(ngrams_3), big.mark=',')
unique_ngrams_3 <- format(length(unique(ngrams_3)), big.mark=',')
ngrams_3_freq <- data.frame(table(sapply(ngrams_3, paste, collapse = " ")))
names(ngrams_3_freq) <- c("trigram","freq")
ngrams_3_freq['trigram'] <- as.character(ngrams_3_freq$trigram)
ngrams_3_freq['word1'] <- sapply(strsplit(as.character(ngrams_3_freq$trigram)," "), `[`, 1)
ngrams_3_freq['word2'] <- sapply(strsplit(as.character(ngrams_3_freq$trigram)," "), `[`, 2)
ngrams_3_freq['word3'] <- sapply(strsplit(as.character(ngrams_3_freq$trigram)," "), `[`, 3)
ngrams_3_freq <- ngrams_3_freq[,c('trigram','word1','word2','word3','freq')]
ngrams_3_freq <- ngrams_3_freq[order(ngrams_3_freq$freq,decreasing = TRUE),]
ngrams_3_freq['cum_freq'] <- cumsum(ngrams_3_freq$freq)/sum(ngrams_3_freq$freq)
plot(ngrams_3_freq[ngrams_3_freq['cum_freq']<0.9,'freq'],
ylim=c(0,200),
ylab='Freq in training dataset',
xlab='Index of 3-grams providing 90% dataset coverage',
type='l',
main='3-grams sorted based on frequency')
coverage_50 <- ngrams_3_freq[ngrams_3_freq['cum_freq']<=0.5,]
coverage_90 <- ngrams_3_freq[ngrams_3_freq['cum_freq']<=0.9,]
coverage_95 <- ngrams_3_freq[ngrams_3_freq['cum_freq']<=0.95,]
coverage_50 <- format(dim(coverage_50)[1], nsmall=0, big.mark=',')
coverage_90 <- format(dim(coverage_90)[1], nsmall=0, big.mark=',')
coverage_95 <- format(dim(coverage_95)[1], nsmall=0, big.mark=',')
ngrams_3_freq[1:10,c("trigram","freq")]
pruning <- 1:10
ngrams_3_number <- sapply(pruning, FUN=function(x) {
sum(ngrams_3_freq[ngrams_3_freq['freq']<x,'freq'])
})
compression <- sapply(ngrams_3_number, FUN=function(x) {
x/length(ngrams_3)
})
df_p <- data.frame(pruning,ngrams_3_number,compression)
df_p[,'pruning'] <- sapply(df_p[,'pruning'], FUN=function(x){
paste("<",as.character(x))
})
names(df_p) <- c("freq 3-grams removed","N","reduction")
df_p
ngrams_3_freq <- ngrams_3_freq[ngrams_3_freq['freq']>3,]
ngrams_3_freq['bigram'] <- paste(ngrams_3_freq$word1, ngrams_3_freq$word2, sep=" ")
#see comment in the 2-gram section regarding this
# trigram_matrix <- merge(ngrams_3_freq[,c('bigram','word3','freq')],
# ngrams_2_freq[,c('bigram','freq')],
# by.x="bigram",
# by.y="bigram",
# all=TRUE)
aggregate_2_gram <- aggregate(ngrams_3_freq$freq, by=list(Category=ngrams_3_freq$bigram), FUN=sum)
ngrams_3_freq <- merge(ngrams_3_freq,
aggregate_2_gram,
by.x='bigram',
by.y='Category')
names(ngrams_3_freq)[8] <- c('bigram_freq')
ngrams_3_freq['P'] <- ngrams_3_freq['freq']/ngrams_3_freq['bigram_freq']
# dim(ngrams_3_freq[is.na(ngrams_3_freq['word3']),])
# dim(ngrams_3_freq[is.na(ngrams_3_freq['word2']),])
# dim(ngrams_3_freq[is.na(ngrams_3_freq['word1']),])
# ngrams_3_freq[is.na(ngrams_3_freq['P']),'P']
size_trigram_freq <- format(object.size(ngrams_3_freq[,c(1:7)]), units = "Mb", standard = "auto", digits = 1L)
size_trigram_P <- format(object.size(ngrams_3_freq), units = "Mb", standard = "auto", digits = 1L)
trigram_matrix <- acast(ngrams_3_freq, bigram~word3, value.var="P")
# sum(trigram_matrix_2[is.na(trigram_matrix_2)])
size_trigram_matrix <- format(object.size(trigram_matrix), units = "Mb", standard = "auto", digits = 1L)
# table(ngrams_3)['thanks for the']/table(ngrams_2)['thanks for']
# ngrams_3_freq[ngrams_3_freq['bigram']=='thanks for' & ngrams_3_freq['word3']=='the','P']
# sum(ngrams_3_freq[ngrams_3_freq['bigram']=='thanks for','P'])
# trigram_matrix['thanks for','the']
text_training <- tolower(text_training)
ngrams_4 <- regmatches(text_training,gregexpr("([a-z']+ [a-z']+ [a-z']+ [a-z']+)",text_training))[[1]]
ngrams_4 <- ngrams_4[!is.na(ngrams_4)]
ngrams_4_2 <- matrix(unlist(strsplit(ngrams_4," ")), ncol=4, byrow=TRUE)
ngrams_4 <- ngrams_4[(!ngrams_4_2[,1] %in% prophanities)&
(!ngrams_4_2[,2] %in% prophanities)&
(!ngrams_4_2[,3] %in% prophanities)&
(!ngrams_4_2[,4] %in% prophanities)&
(ngrams_4_2[,1] %in% english_dict)&
(ngrams_4_2[,2] %in% english_dict)&
(ngrams_4_2[,3] %in% english_dict)&
(ngrams_4_2[,4] %in% english_dict)]
rm(ngrams_4_2)
ngrams_4 <- ngrams_4[!is.na(ngrams_4)]
total_ngrams_4 <- format(length(ngrams_4), big.mark=',')
unique_ngrams_4 <- format(length(unique(ngrams_4)), big.mark=',')
ngrams_4_freq <- data.frame(table(sapply(ngrams_4, paste, collapse = " ")))
names(ngrams_4_freq) <- c("tetragram","freq")
ngrams_4_freq['tetragram'] <- as.character(ngrams_4_freq$tetragram)
ngrams_4_freq['word1'] <- sapply(strsplit(as.character(ngrams_4_freq$tetragram)," "), `[`, 1)
ngrams_4_freq['word2'] <- sapply(strsplit(as.character(ngrams_4_freq$tetragram)," "), `[`, 2)
ngrams_4_freq['word3'] <- sapply(strsplit(as.character(ngrams_4_freq$tetragram)," "), `[`, 3)
ngrams_4_freq['word4'] <- sapply(strsplit(as.character(ngrams_4_freq$tetragram)," "), `[`, 4)
ngrams_4_freq <- ngrams_4_freq[,c('tetragram','word1','word2','word3','word4','freq')]
ngrams_4_freq <- ngrams_4_freq[order(ngrams_4_freq$freq,decreasing = TRUE),]
ngrams_4_freq['cum_freq'] <- cumsum(ngrams_4_freq$freq)/sum(ngrams_4_freq$freq)
coverage_50 <- ngrams_4_freq[ngrams_4_freq['cum_freq']<=0.5,]
coverage_90 <- ngrams_4_freq[ngrams_4_freq['cum_freq']<=0.9,]
coverage_95 <- ngrams_4_freq[ngrams_4_freq['cum_freq']<=0.95,]
coverage_50 <- format(dim(coverage_50)[1], nsmall=0, big.mark=',')
coverage_90 <- format(dim(coverage_90)[1], nsmall=0, big.mark=',')
coverage_95 <- format(dim(coverage_95)[1], nsmall=0, big.mark=',')
ngrams_4_freq[1:10,c("tetragram","freq")]
pruning <- 1:10
ngrams_4_number <- sapply(pruning, FUN=function(x) {
sum(ngrams_4_freq[ngrams_4_freq['freq']<x,'freq'])
})
compression <- sapply(ngrams_4_number, FUN=function(x) {
x/length(ngrams_4)
})
df_p <- data.frame(pruning,ngrams_4_number,compression)
df_p[,'pruning'] <- sapply(df_p[,'pruning'], FUN=function(x){
paste("<",as.character(x))
})
names(df_p) <- c("freq 4-grams removed","N","reduction")
df_p
ngrams_4_freq <- ngrams_4_freq[ngrams_4_freq['freq']>1,]
ngrams_4_freq['trigram'] <- paste(ngrams_4_freq$word1, ngrams_4_freq$word2, ngrams_4_freq$word3, sep=" ")
#see comment in the 2-gram section regarding this
# tetragram_matrix <- merge(ngrams_4_freq[,c('trigram','word4','freq')],
# ngrams_3_freq[,c('trigram','freq')],
# by.x="trigram",
# by.y="trigram",
# all=TRUE)
aggregate_3_gram <- aggregate(ngrams_4_freq$freq, by=list(Category=ngrams_4_freq$trigram), FUN=sum)
ngrams_4_freq <- merge(ngrams_4_freq,
aggregate_3_gram,
by.x='trigram',
by.y='Category')
names(ngrams_4_freq)[9] <- c('trigram_freq')
ngrams_4_freq['P'] <- ngrams_4_freq['freq']/ngrams_4_freq['trigram_freq']
# dim(ngrams_4_freq[is.na(ngrams_4_freq['word4']),])
# dim(ngrams_4_freq[is.na(ngrams_4_freq['word3']),])
# dim(ngrams_4_freq[is.na(ngrams_4_freq['word2']),])
# dim(ngrams_4_freq[is.na(ngrams_4_freq['word1']),])
# ngrams_4_freq[is.na(ngrams_4_freq['P']),'P']
size_tetragram_freq <- format(object.size(ngrams_4_freq[,c(1:8)]), units = "Mb", standard = "auto", digits = 1L)
size_tetragram_P <- format(object.size(ngrams_4_freq), units = "Mb", standard = "auto", digits = 1L)
tetragram_matrix <- acast(ngrams_4_freq, trigram~word4, value.var="P")
# sum(tetragram_matrix[is.na(tetragram_matrix)])
size_tetragram_matrix <- format(object.size(tetragram_matrix),
units = "Mb", standard = "auto", digits = 1L)
# table(ngrams_4)['thanks for the follow']/table(ngrams_3)['thanks for the']
# ngrams_4_freq[ngrams_4_freq['trigram']=='thanks for the' & ngrams_4_freq['word4']=='follow','P']
# sum(ngrams_4_freq[ngrams_4_freq['trigram']=='thanks for the','P'])
# tetragram_matrix['thanks for the','follow']
text_training <- tolower(text_training)
ngrams_5 <- regmatches(text_training,gregexpr("([a-z']+ [a-z']+ [a-z']+ [a-z']+ [a-z']+)",text_training))[[1]]
ngrams_5 <- ngrams_5[!is.na(ngrams_5)]
ngrams_5_2 <- matrix(unlist(strsplit(ngrams_5," ")), ncol=5, byrow=TRUE)
ngrams_5 <- ngrams_5[(!ngrams_5_2[,1] %in% prophanities)&
(!ngrams_5_2[,2] %in% prophanities)&
(!ngrams_5_2[,3] %in% prophanities)&
(!ngrams_5_2[,4] %in% prophanities)&
(!ngrams_5_2[,5] %in% prophanities)&
(ngrams_5_2[,1] %in% english_dict)&
(ngrams_5_2[,2] %in% english_dict)&
(ngrams_5_2[,3] %in% english_dict)&
(ngrams_5_2[,4] %in% english_dict)&
(ngrams_5_2[,5] %in% english_dict)]
ngrams_5 <- ngrams_5[!is.na(ngrams_5)]
rm(ngrams_5_2)
total_ngrams_5 <- format(length(ngrams_5), big.mark=',')
unique_ngrams_5 <- format(length(unique(ngrams_5)), big.mark=',')
ngrams_5_freq <- data.frame(table(sapply(ngrams_5, paste, collapse = " ")))
names(ngrams_5_freq) <- c("pentagram","freq")
ngrams_5_freq['pentagram'] <- as.character(ngrams_5_freq$pentagram)
ngrams_5_freq['word1'] <- sapply(strsplit(as.character(ngrams_5_freq$pentagram)," "), `[`, 1)
ngrams_5_freq['word2'] <- sapply(strsplit(as.character(ngrams_5_freq$pentagram)," "), `[`, 2)
ngrams_5_freq['word3'] <- sapply(strsplit(as.character(ngrams_5_freq$pentagram)," "), `[`, 3)
ngrams_5_freq['word4'] <- sapply(strsplit(as.character(ngrams_5_freq$pentagram)," "), `[`, 4)
ngrams_5_freq['word5'] <- sapply(strsplit(as.character(ngrams_5_freq$pentagram)," "), `[`, 5)
ngrams_5_freq <- ngrams_5_freq[,c('pentagram','word1','word2','word3','word4','word5','freq')]
ngrams_5_freq <- ngrams_5_freq[order(ngrams_5_freq$freq,decreasing = TRUE),]
ngrams_5_freq['cum_freq'] <- cumsum(ngrams_5_freq$freq)/sum(ngrams_5_freq$freq)
coverage_50 <- ngrams_5_freq[ngrams_5_freq['cum_freq']<=0.5,]
coverage_90 <- ngrams_5_freq[ngrams_5_freq['cum_freq']<=0.9,]
coverage_95 <- ngrams_5_freq[ngrams_5_freq['cum_freq']<=0.95,]
coverage_50 <- format(dim(coverage_50)[1], nsmall=0, big.mark=',')
coverage_90 <- format(dim(coverage_90)[1], nsmall=0, big.mark=',')
coverage_95 <- format(dim(coverage_95)[1], nsmall=0, big.mark=',')
ngrams_5_freq[1:10,c("pentagram","freq")]
pruning <- 1:10
ngrams_5_number <- sapply(pruning, FUN=function(x) {
sum(ngrams_5_freq[ngrams_5_freq['freq']<x,'freq'])
})
compression <- sapply(ngrams_5_number, FUN=function(x) {
x/length(ngrams_5)
})
df_p <- data.frame(pruning,ngrams_5_number,compression)
df_p[,'pruning'] <- sapply(df_p[,'pruning'], FUN=function(x){
paste("<",as.character(x))
})
names(df_p) <- c("freq 5-grams removed","N","reduction")
df_p
ngrams_5_freq <- ngrams_5_freq[ngrams_5_freq['freq']>1,]
ngrams_5_freq['tetragram'] <- paste(ngrams_5_freq$word1, ngrams_5_freq$word2,
ngrams_5_freq$word3, ngrams_5_freq$word4, sep=" ")
#see comment in the 2-gram section regarding this
# pentagram_matrix <- merge(ngrams_5_freq[,c('tetragram','word5','freq')],
# ngrams_4_freq[,c('tetragram','freq')],
# by.x="tetragram",
# by.y="tetragram",
# all=TRUE)
aggregate_4_gram <- aggregate(ngrams_5_freq$freq, by=list(Category=ngrams_5_freq$tetragram), FUN=sum)
ngrams_5_freq <- merge(ngrams_5_freq,
aggregate_4_gram,
by.x='tetragram',
by.y='Category')
names(ngrams_5_freq)[10] <- c('tetragram_freq')
ngrams_5_freq['P'] <- ngrams_5_freq['freq']/ngrams_5_freq['tetragram_freq']
# dim(ngrams_5_freq[is.na(ngrams_5_freq['word5']),])
# dim(ngrams_5_freq[is.na(ngrams_5_freq['word4']),])
# dim(ngrams_5_freq[is.na(ngrams_5_freq['word3']),])
# dim(ngrams_5_freq[is.na(ngrams_5_freq['word2']),])
# dim(ngrams_5_freq[is.na(ngrams_5_freq['word1']),])
# ngrams_5_freq[is.na(ngrams_5_freq['P']),'P']
size_pentagram_freq <- format(object.size(ngrams_5_freq[,c(1:9)]), units = "Mb", standard = "auto", digits = 1L)
size_pentagram_P <- format(object.size(ngrams_5_freq), units = "Mb", standard = "auto", digits = 1L)
pentagram_matrix <- acast(ngrams_5_freq, tetragram~word5, value.var="P")
# sum(pentagram_matrix[is.na(pentagram_matrix)])
size_pentagram_matrix <- format(object.size(pentagram_matrix),
units = "Mb", standard = "auto", digits = 1L)
# table(ngrams_5)['thank you so much for']/table(ngrams_4)['thank you so much']
# ngrams_5_freq[ngrams_5_freq['tetragram']=='thank you so much' & ngrams_5_freq['word5']=='for','P']
# sum(ngrams_5_freq[ngrams_5_freq['tetragram']=='thank you so much','P'])
# pentagram_matrix['thank you so much','for']
df <- data.frame(n=c(length(ngrams_1_freq$ngram),
length(ngrams_2_freq$bigram),
length(ngrams_3_freq$trigram),
length(ngrams_4_freq$tetragram),
length(ngrams_5_freq$pentagram)),
s1=c(format(object.size(ngrams_1_freq[,1:2]), units = "Mb", standard = "auto", digits = 1L),
format(object.size(ngrams_2_freq[,1:5]), units = "Mb", standard = "auto", digits = 1L),
format(object.size(ngrams_3_freq[,1:7]), units = "Mb", standard = "auto", digits = 1L),
format(object.size(ngrams_4_freq[,1:8]), units = "Mb", standard = "auto", digits = 1L),
format(object.size(ngrams_5_freq[,1:9]), units = "Mb", standard = "auto", digits = 1L)),
s2=c(format(object.size(ngrams_1_freq), units = "Mb", standard = "auto", digits = 1L),
format(object.size(ngrams_2_freq), units = "Mb", standard = "auto", digits = 1L),
format(object.size(ngrams_3_freq), units = "Mb", standard = "auto", digits = 1L),
format(object.size(ngrams_4_freq), units = "Mb", standard = "auto", digits = 1L),
format(object.size(ngrams_5_freq), units = "Mb", standard = "auto", digits = 1L)),
s3=c(NA,
format(object.size(bigram_matrix), units = "Mb", standard = "auto", digits = 1L),
format(object.size(trigram_matrix), units = "Mb", standard = "auto", digits = 1L),
size_tetragram_matrix,
size_pentagram_matrix))
rownames(df) <- c("1-gram","2-gram","3-gram","4-gram","5-gram")
colnames(df) <- c("ngrams N",
"Freq dataframe",
"P dataframe",
"Full P matrix")
df
rm(bigram_matrix)
rm(trigram_matrix)
rm(tetragram_matrix)
rm(pentagram_matrix)
set.seed(1234)
testing_dataset_small <- sample(testing_dataset,length(testing_dataset)*0.0001)
testing_lines <- format(length(testing_dataset_small), big.mark = ',')
testing_words <- sapply(testing_dataset_small, FUN=function(x){
length(strsplit(x," ")[[1]])
})
testing_words <- format(sum(testing_words), big.mark = ',')
predictions <- sapply(testing_dataset_small, FUN=function(x) {
x <- tolower(x)
words <- unlist(strsplit(x, " "))
if (length(words)>1) {
results <- c()
for (i in 1:(length(words)-1)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
start <- Sys.time()
prediction_df <- ngrams_2_freq[ngrams_2_freq['word1']==word1,]
prediction_df <- prediction_df[order(-prediction_df$P),'word2']
prediction_3words <- prediction_df[1:3]
prediction_2words <- prediction_df[1:2]
prediction_1words <- prediction_df[1]
end <- Sys.time()
t <- as.numeric(end - start)
if (!is.na(word2)) {
results <- c(results,
word2 %in% prediction_3words,
word2 %in% prediction_2words,
word2 %in% prediction_1words,
is.na(prediction_3words[1])&is.na(prediction_3words[2])&is.na(prediction_3words[3]),
NA,
t)
}
results
}
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2))
results2_means
}
})
predictions <- matrix(unlist(predictions), ncol=6, byrow=TRUE)
predictions_accuracy_means_2_gram <- sapply(data.frame(predictions), FUN=mean)
predictions <- sapply(testing_dataset_small, FUN=function(x) {
x <- tolower(x)
words <- unlist(strsplit(x, " "))
if (length(words)>2) {
results <- c()
for (i in 1:(length(words)-2)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
start <- Sys.time()
prediction_df <- ngrams_3_freq[ngrams_3_freq['word1']==word1 &
ngrams_3_freq['word2']==word2,]
prediction_df <- prediction_df[order(-prediction_df$P),'word3']
prediction_3words <- prediction_df[1:3]
prediction_2words <- prediction_df[1:2]
prediction_1words <- prediction_df[1]
end <- Sys.time()
t <- as.numeric(end - start)
if (!is.na(word3)) {
results <- c(results,
word3 %in% prediction_3words,
word3 %in% prediction_2words,
word3 %in% prediction_1words,
is.na(prediction_3words[1])&is.na(prediction_3words[2])&is.na(prediction_3words[3]),
NA,
t)
}
results
}
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2))
results2_means
}
})
predictions <- matrix(unlist(predictions), ncol=6, byrow=TRUE)
predictions_accuracy_means_3_gram <- sapply(data.frame(predictions), FUN=mean)
predictions <- sapply(testing_dataset_small, FUN=function(x) {
x <- tolower(x)
words <- unlist(strsplit(x, " "))
if (length(words)>3) {
results <- c()
for (i in 1:(length(words)-3)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
word4 <- gsub("[^a-z']","",words[i+3])
start <- Sys.time()
prediction_df <- ngrams_4_freq[ngrams_4_freq['word1']==word1 &
ngrams_4_freq['word2']==word2 &
ngrams_4_freq['word3']==word3,]
prediction_df <- prediction_df[order(-prediction_df$P),'word4']
prediction_3words <- prediction_df[1:3]
prediction_2words <- prediction_df[1:2]
prediction_1words <- prediction_df[1]
end <- Sys.time()
t <- as.numeric(end - start)
if (!is.na(word4)) {
results <- c(results,
word4 %in% prediction_3words,
word4 %in% prediction_2words,
word4 %in% prediction_1words,
is.na(prediction_3words[1])&is.na(prediction_3words[2])&is.na(prediction_3words[3]),
NA,
t)
}
results
}
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2))
results2_means
}
})
predictions <- matrix(unlist(predictions), ncol=6, byrow=TRUE)
predictions_accuracy_means_4_gram <- sapply(data.frame(predictions), FUN=mean)
predictions <- sapply(testing_dataset_small, FUN=function(x) {
x <- tolower(x)
words <- unlist(strsplit(x, " "))
if (length(words)>4) {
results <- c()
for (i in 1:(length(words)-4)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
word4 <- gsub("[^a-z']","",words[i+3])
word5 <- gsub("[^a-z']","",words[i+4])
start <- Sys.time()
prediction_df <- ngrams_5_freq[ngrams_5_freq['word1']==word1 &
ngrams_5_freq['word2']==word2 &
ngrams_5_freq['word3']==word3 &
ngrams_5_freq['word4']==word4,]
prediction_df <- prediction_df[order(-prediction_df$P),'word5']
prediction_3words <- prediction_df[1:3]
prediction_2words <- prediction_df[1:2]
prediction_1words <- prediction_df[1]
end <- Sys.time()
t <- as.numeric(end - start)
if (!is.na(word5)) {
results <- c(results,
word5 %in% prediction_3words,
word5 %in% prediction_2words,
word5 %in% prediction_1words,
is.na(prediction_3words[1])&is.na(prediction_3words[2])&is.na(prediction_3words[3]),
NA,
t)
}
results
}
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2))
results2_means
}
})
predictions <- matrix(unlist(predictions), ncol=6, byrow=TRUE)
predictions_accuracy_means_5_gram <- sapply(data.frame(predictions), FUN=mean)
results_final <- data.frame(matrix(c(predictions_accuracy_means_2_gram,
predictions_accuracy_means_3_gram,
predictions_accuracy_means_4_gram,
predictions_accuracy_means_5_gram), ncol=6, byrow=TRUE))
results_final[,4] <- 1 - results_final[,4]
results_final[,6] <- round(results_final[,6],5)
results_final[,1:5] <- round(results_final[,1:5],2)
colnames(results_final) <- c("Accuracy 3w","Accuracy 2w","Accuracy 1w","Coverage","Perplexity","Time (sec)")
rownames(results_final) <- c("2-grams","3-grams","4-grams","5-grams")
results_final
#this chunk was run in a separed window, the results were then loaded and processed in the next chunk
set.seed(1234)
training_dataset_2 <- ngrams_3[sample(length(ngrams_3),length(ngrams_3)*0.0001)]
results <- data.frame(trigrams=training_dataset_2)
w1 <- sample(seq(0.0000001,1,0.0000001),1000,replace=TRUE)
w2 <- sample(seq(0.0000001,1,0.0000001),1000,replace=TRUE)
w3 <- 1 - w1 - w2
weights <- data.frame(w1,w2,w3)
weights <- weights[weights['w3']>0,]
for (w in rownames(weights)) {
col_results <- c()
for (i in training_dataset_2) {
# i <- "of parenting looks"
# w<-1
words <- unlist(strsplit(tolower(i)," "))
head_bigram <- paste(words[1],words[2],sep=" ")
if (head_bigram %in% ngrams_3_freq$bigram) {
p_3 <- ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram,"P"]
names(p_3) <- ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram,"word3"]
p_3 <- sort(p_3)
names(p_3) <- paste(head_bigram,names(p_3),sep=" ")
} else {
p_3 <- rep(0, dim(ngrams_3_freq)[1])
names(p_3) <- paste(head_bigram,ngrams_3_freq$word3,sep=" ")
}
if (words[2] %in% ngrams_2_freq$word1) {
p_2 <- ngrams_2_freq[ngrams_2_freq['word1']==word1,"P"]
names(p_2) <- ngrams_2_freq[ngrams_2_freq['word1']==word1,"word2"]
p_2 <- sort(p_2)
names(p_2) <- paste(words[1],words[2],names(p_2),sep=" ")
} else {
p_2 <- rep(0, dim(ngrams_2_freq)[1])
names(p_2) <- paste(words[1],words[2],ngrams_2_freq$word2,sep=" ")
}
p_1 <- rep(1/dim(ngrams_1_freq)[1], dim(ngrams_1_freq)[1])
names(p_1) <- paste(words[1],words[2],ngrams_1_freq$ngram)
p_1 <- p_1 * weights[w,'w1']
p_2 <- p_2 * weights[w,'w2']
p_3 <- p_3 * weights[w,'w3']
final_p <- p_1
m <- match(names(p_2), names(p_1))
final_p[m] <- final_p[m] + p_2
m <- match(names(p_3), names(final_p))
final_p[m] <- final_p[m] + p_3
final_p <- sort(final_p, decreasing=TRUE)[1:10]
names(final_p) <- sapply(names(final_p), FUN=function(x){
strsplit(x," ")[[1]][length(strsplit(x," ")[[1]])]
})
predictions <- data.frame(prediction=names(final_p), p=final_p)
predictions['prediction'] <- as.character(predictions$prediction)
predictions <- predictions[!duplicated(predictions$prediction),]
predictions <- predictions[1:3,]
if (words[3] %in% predictions$prediction) {
col_results <- c(col_results,1)
} else {
col_results <- c(col_results,0)
}
# if (length(col_results)%%10==0) {print(length(col_results))}
}
results[,w] <- col_results
print(w)
}
results['means',2:dim(results)[2]] <- lapply(results[,2:dim(results)[2]], FUN=mean)
weights <- rbind(c(NA,NA,NA),weights)
weights2 <- t(weights)
colnames(weights2) <- names(results)
results <- rbind(results,weights2)
write.csv(results,"interpolation_results_final.csv", row.names = TRUE, col.names = TRUE)
results <- read.csv("interpolation_results_final.csv")
rownames(results) <- results$X
results <- results[,2:dim(results)[2]]
par(mfrow=c(1,2))
plot(as.numeric(results['means',]),
xlab='Index unsorted weight combinations',
ylab='Accuracy on 3rd word')
plot(sort(as.numeric(results['means',])),
xlab='Index sorted weight combinations',
ylab='Accuracy on 3rd word')
means <- unlist(results['mean',])
max_accuracy <- max(means[-1])
columns <- c(FALSE, means==max_accuracy)[-2]
max_Accuracy_weights <- t(results[c("means","w1","w2","w3"),columns])
highest_weights <- dim(max_Accuracy_weights)[1]
# Inputs
final_weights <- max_Accuracy_weights[1,c("w1","w2","w3")]
text <- tolower("I will be")
words <- unlist(strsplit(tolower(text)," "))
# Probabilities calculated using the 1- 2- and 3-gram models
p_trigrams <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1], words[2], sep=" ") &
ngrams_3_freq['word3']==words[3], "P"]
p_bigrams <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] &
ngrams_2_freq['word2']==words[3], "P"]
p_unigrams <- 1/length(dictionary)
# Weighted probabilities for linear interpolation
p_unigrams_w <- p_unigrams * final_weights[1]
p_bigrams_w <- p_bigrams * final_weights[2]
p_trigrams_w <- p_trigrams * final_weights[3]
# Final probability
final_p <- p_unigrams_w + p_bigrams_w + p_trigrams_w
# Output table
values <- unname(c(p_unigrams,p_bigrams,p_trigrams,final_weights,
p_unigrams_w,p_bigrams_w,p_trigrams_w,
NA,NA,final_p))
matrix <- matrix(values, ncol=3, byrow=TRUE)
colnames(matrix) <- c("1-gram model","2-gram model","3-gram model")
rownames(matrix) <- c("P","weights","weighted P","total P")
round(matrix,3)
predictions <- sapply(testing_dataset_small, FUN=function(x) {
x <- tolower(x)
words <- unlist(strsplit(x, " "))
if (length(words)>3) {
results <- c()
for (i in 1:(length(words)-3)) {
# words <- c('i','will','be')
# i<-1
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
head_bigram <- paste(word1, word2, sep=" ")
start <- Sys.time()
if (head_bigram %in% ngrams_3_freq$bigram) {
p_3 <- ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram,"P"]
names(p_3) <- ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram,"word3"]
p_3 <- sort(p_3)
names(p_3) <- paste(head_bigram,names(p_3),sep=" ")
} else {
p_3 <- rep(0, dim(ngrams_3_freq)[1])
names(p_3) <- paste(head_bigram,ngrams_3_freq$word3,sep=" ")
}
if (word2 %in% ngrams_2_freq$word1) {
p_2 <- ngrams_2_freq[ngrams_2_freq['word1']==word2,"P"]
names(p_2) <- ngrams_2_freq[ngrams_2_freq['word1']==word2,"word2"]
p_2 <- sort(p_2)
names(p_2) <- paste(head_bigram,names(p_2),sep=" ")
} else {
p_2 <- rep(0, dim(ngrams_2_freq)[1])
names(p_2) <- paste(head_bigram,ngrams_2_freq$word2,sep=" ")
}
p_1 <- rep(1/length(dictionary), length(dictionary))
names(p_1) <- paste(head_bigram,dictionary)
p_1 <- p_1 * final_weights['w1']
p_2 <- p_2 * final_weights['w2']
p_3 <- p_3 * final_weights['w3']
final_p <- p_1
m <- match(names(p_2), names(p_1))
final_p[m] <- final_p[m] + p_2
m <- match(names(p_3), names(final_p))
final_p[m] <- final_p[m] + p_3
trigram_probability <- final_p[paste(word1,word2,word3,sep=" ")]
if (is.na(trigram_probability)) {
trigram_probability <- 1/length(dictionary) * final_weights['w1']
}
final_p <- sort(final_p, decreasing=TRUE)[1:10]
names(final_p) <- sapply(names(final_p), FUN=function(x){
strsplit(x," ")[[1]][length(strsplit(x," ")[[1]])]
})
end <- Sys.time()
t <- as.numeric(end - start)
if (!is.na(word3)) {
results <- c(results,
word3 %in% names(final_p[1:3]),
word3 %in% names(final_p[1:2]),
word3 %in% names(final_p[1]),
unique(!is.na(unique(final_p))),
as.numeric(1/trigram_probability),
t)
}
results
}
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(apply(data.frame(results2), FUN=mean, MARGIN=2))
results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
results2_means
}
})
predictions <- matrix(unlist(predictions), ncol=6, byrow=TRUE)
predictions_accuracy_means_interpolated <- sapply(data.frame(predictions), FUN=mean)
predictions_accuracy_means_interpolated[5] <- exp(sum(log(predictions[,5]))/length(predictions[,5]))
results_final_liniter <- data.frame(t(predictions_accuracy_means_interpolated))
results_final_liniter[,6] <- round(results_final_liniter[,6],5)
results_final_liniter[,1:5] <- round(results_final_liniter[,1:5],3)
results_final_liniter[,5] <- round(results_final_liniter[,5],0)
colnames(results_final_liniter) <- c("Accuracy 3w","Accuracy 2w","Accuracy 1w",
"Coverage","Perplexity","Time (sec)")
rownames(results_final_liniter) <- c("Linear int.")
results_final_liniter
aggregate_2_grams <- aggregate(ngrams_2_freq$freq, by=list(Category=ngrams_2_freq$word1), FUN=sum)
get_bigrams_p_laplace_smoothing <- function(x) {
words <- unlist(strsplit(tolower(x)," ")[1])
if (length(words)>1 & (!"" %in% words)) {
head_word <- words[(length(words)-1)]
tail_word <- words[length(words)]
all_gram_length <- length(ngrams_1_freq$ngram)
all_bigrams <- paste(head_word,ngrams_1_freq$ngram, sep=" ")
#Named vector of unobserved 2-grams
freq_0_bigrams <- all_bigrams[!all_bigrams %in% ngrams_2_freq$bigram]
freq_0_bigrams_named_vector <- rep(0,length(freq_0_bigrams))
names(freq_0_bigrams_named_vector) <- freq_0_bigrams
#Named vector of observed 2-grams
freq_non_0_bigrams <- as.character(ngrams_2_freq[ngrams_2_freq['word1']==head_word, 'bigram'])
freq_non_0_bigrams_vector <- ngrams_2_freq[ngrams_2_freq['word1']==head_word, 'freq']
names(freq_non_0_bigrams_vector) <- freq_non_0_bigrams
#Calculate the named vector of sorted probabilities according to the MLE extimation with
#add-1 smoothing (Lapace smoothing)
freq_0_bigrams_named_vector_p <- (freq_0_bigrams_named_vector + 1)/all_gram_length
freq_non_0_bigrams_vector_p <- (freq_non_0_bigrams_vector+1)/
(aggregate_2_grams[aggregate_2_grams['Category']==head_word,'x'] +
all_gram_length)
final_p <- sort(c(freq_0_bigrams_named_vector_p,
freq_non_0_bigrams_vector_p),decreasing=TRUE)
if (!is.na(final_p[paste(head_word, tail_word, sep=" ")])) {
perplexity <- 1/final_p[paste(head_word, tail_word, sep=" ")]
} else {
perplexity <- length(dictionary)
}
list(final_p, perplexity)
} else {
list(NA,NA)
}
}
example <- "will be"
results <- get_bigrams_p_laplace_smoothing(example)
results[[1]]['will be']
get_trigrams_p_laplace_smoothing <- function(x) {
words <- unlist(strsplit(tolower(x)," ")[1])
if (length(words)>2 & (!"" %in% words)) {
head_bigram <- paste(words[(length(words)-2)], words[(length(words)-1)], sep=" ")
tail_word <- words[length(words)]
all_gram_length <- length(ngrams_1_freq$ngram)
all_trigrams <- paste(head_bigram,ngrams_1_freq$ngram, sep=" ")
#Named vector of observed 3-grams
freq_0_trigrams <- all_trigrams[!all_trigrams %in% ngrams_3_freq$trigram]
freq_0_trigrams_named_vector <- rep(0,length(freq_0_trigrams))
names(freq_0_trigrams_named_vector) <- freq_0_trigrams
freq_0_trigrams_named_vector_p <- (freq_0_trigrams_named_vector + 1)/all_gram_length
#Named vector of observed 3-grams
freq_non_0_trigrams <- as.character(ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram, 'trigram'])
freq_non_0_trigrams_vector <- ngrams_3_freq[ngrams_3_freq['bigram']==head_bigram, 'freq']
names(freq_non_0_trigrams_vector) <- freq_non_0_trigrams
freq_non_0_trigrams_vector_p <- (freq_non_0_trigrams_vector+1)/
(aggregate_3_grams[aggregate_3_grams['Category']==head_bigram,'x'] +
all_gram_length)
#Calculate the named vector of sorted probabilities according to the MLE extimation with
#add-1 smoothing (Lapace smoothing)
final_p <- sort(c(freq_0_trigrams_named_vector_p,
freq_non_0_trigrams_vector_p),decreasing=TRUE)
perplexity <- 1/final_p[paste(head_bigram, tail_word, sep=" ")]
list(final_p, perplexity)
} else {
list(NA,NA)
}
}
get_tetragrams_p_laplace_smoothing <- function(x) {
words <- unlist(strsplit(tolower(x)," ")[1])
if (length(words)>3 & (!"" %in% words)) {
head_trigram <- paste(words[(length(words)-3)], words[(length(words)-2)], words[(length(words)-1)], sep=" ")
tail_word <- words[length(words)]
all_gram_length <- length(ngrams_1_freq$ngram)
all_tetragrams <- paste(head_trigram,ngrams_1_freq$ngram, sep=" ")
#Named vector of observed 4-grams
freq_0_tetragrams <- all_tetragrams[!all_tetragrams %in% ngrams_4_freq$tetragram]
freq_0_tetragrams_named_vector <- rep(0,length(freq_0_tetragrams))
names(freq_0_tetragrams_named_vector) <- freq_0_tetragrams
freq_0_tetragrams_named_vector_p <- (freq_0_tetragrams_named_vector + 1)/all_gram_length
#Named vector of observed 4-grams
freq_non_0_tetragrams <- as.character(ngrams_4_freq[ngrams_4_freq['trigram']==head_trigram, 'tetragram'])
freq_non_0_tetragrams_vector <- ngrams_4_freq[ngrams_4_freq['trigram']==head_trigram, 'freq']
names(freq_non_0_tetragrams_vector) <- freq_non_0_tetragrams
freq_non_0_tetragrams_vector_p <- (freq_non_0_tetragrams_vector+1)/
(aggregate_4_grams[aggregate_4_grams['Category']==head_trigram,'x'] +
all_gram_length)
#Calculate the named vector of sorted probabilities according to the MLE extimation with
#add-1 smoothing (Lapace smoothing)
final_p <- sort(c(freq_0_tetragrams_named_vector_p,
freq_non_0_tetragrams_vector_p),decreasing=TRUE)
perplexity <- 1/final_p[paste(head_trigram, tail_word, sep=" ")]
list(final_p, perplexity)
} else {
list(NA,NA)
}
}
get_pentagrams_p_laplace_smoothing <- function(x) {
words <- unlist(strsplit(tolower(x)," ")[1])
if (length(words)>4 & (!"" %in% words)) {
head_tetragram <- paste(words[(length(words)-4)],
words[(length(words)-3)],
words[(length(words)-2)],
words[(length(words)-1)], sep=" ")
tail_word <- words[length(words)]
all_gram_length <- length(ngrams_1_freq$ngram)
all_pentagrams <- paste(head_tetragram,ngrams_1_freq$ngram, sep=" ")
#Named vector of observed 5-grams
freq_0_pentagrams <- all_pentagrams[!all_pentagrams %in% ngrams_5_freq$pentagram]
freq_0_pentagrams_named_vector <- rep(0,length(freq_0_pentagrams))
names(freq_0_pentagrams_named_vector) <- freq_0_pentagrams
freq_0_pentagrams_named_vector_p <- (freq_0_pentagrams_named_vector + 1)/all_gram_length
#Named vector of observed 5-grams
freq_non_0_pentagrams <- as.character(
ngrams_5_freq[ngrams_5_freq['tetragram']==head_tetragram, 'pentagram']
)
freq_non_0_pentagrams_vector <- ngrams_5_freq[ngrams_5_freq['tetragram']==head_tetragram, 'freq']
names(freq_non_0_pentagrams_vector) <- freq_non_0_pentagrams
freq_non_0_pentagrams_vector_p <- (freq_non_0_pentagrams_vector+1)/
(aggregate_5_grams[aggregate_5_grams['Category']==head_tetragram,'x']+
all_gram_length)
#Calculate the named vector of sorted probabilities according to the MLE extimation with
#add-1 smoothing (Lapace smoothing)
final_p <- sort(c(freq_0_pentagrams_named_vector_p,
freq_non_0_pentagrams_vector_p),decreasing=TRUE)
perplexity <- 1/final_p[paste(head_tetragram, tail_word, sep=" ")]
list(final_p, perplexity)
} else {
list(NA,NA)
}
}
aggregate_2_grams <- aggregate(ngrams_2_freq$freq, by=list(Category=ngrams_2_freq$word1), FUN=sum)
aggregate_3_grams <- aggregate(ngrams_3_freq$freq, by=list(Category=ngrams_3_freq$bigram), FUN=sum)
aggregate_4_grams <- aggregate(ngrams_4_freq$freq, by=list(Category=ngrams_4_freq$trigram), FUN=sum)
aggregate_5_grams <- aggregate(ngrams_5_freq$freq, by=list(Category=ngrams_5_freq$pentagram), FUN=sum)
# example <- "i will be"
# results <- get_trigrams_p_laplace_smoothing(example)
# example <- "thank you so much"
# results <- get_tetragrams_p_laplace_smoothing(example)
# example <- "thank you so much for"
# results <- get_pentagrams_p_laplace_smoothing(example)
# results
predictions_2_grams <- sapply(testing_dataset_small, FUN=function(x) {
words <- unlist(strsplit(tolower(x), " "))
if (length(words)>2) {
results <- c()
for (i in 1:(length(words)-1)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
start <- Sys.time()
data <- get_bigrams_p_laplace_smoothing(paste(word1,
word2,sep=" "))
end <- Sys.time()
t <- as.numeric(end - start)
predicted_words <- sapply(names(data[[1]]), FUN=function(x){
predicted_words <- strsplit(x," ")[[1]]
predicted_last_word <- predicted_words[length(predicted_words)]
predicted_last_word
})
if (!is.na(word2) & (!"" %in% words)) {
if (is.na(data[[2]])){data[[2]]<-1/length(dictionary)}
results <- c(results,
word2 %in% predicted_words[1:3],
word2 %in% predicted_words[1:2],
word2 %in% predicted_words[1],
unique(!is.na(data[[1]])),
data[[2]],
t)
}
}
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
results2_means
}
})
predictions_2_grams <- matrix(unlist(predictions_2_grams), ncol=6, byrow=TRUE)
predictions_2_grams_means <- sapply(data.frame(predictions_2_grams), FUN=mean, na.rm=TRUE)
predictions_2_grams_means[5] <- exp(sum(log(predictions_2_grams[,5]))/length(predictions_2_grams[,5]))
predictions_3_grams <- sapply(testing_dataset_small, FUN=function(x) {
words <- unlist(strsplit(tolower(x), " "))
if (length(words)>3) {
results <- c()
for (i in 1:(length(words)-2)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
start <- Sys.time()
data <- get_trigrams_p_laplace_smoothing(paste(word1,
word2,
word3,sep=" "))
end <- Sys.time()
t <- as.numeric(end - start)
predicted_words <- sapply(names(data[[1]]), FUN=function(x){
predicted_words <- strsplit(x," ")[[1]]
predicted_last_word <- predicted_words[length(predicted_words)]
predicted_last_word
})
if (!is.na(word2) & (!"" %in% words)) {
if (is.na(data[[2]])){data[[2]]<-1/length(dictionary)}
results <- c(results,
word3 %in% predicted_words[1:3],
word3 %in% predicted_words[1:2],
word3 %in% predicted_words[1],
unique(!is.na(data[[1]])),
data[[2]],
t)
}
results
}
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
results2_means
}
})
predictions_3_grams <- matrix(unlist(predictions_3_grams), ncol=6, byrow=TRUE)
predictions_3_grams_means <- sapply(data.frame(predictions_3_grams), FUN=mean, na.rm=TRUE)
predictions_3_grams_means[5] <- exp(sum(log(predictions_3_grams[,5]))/length(predictions_3_grams[,5]))
predictions_4_grams <- sapply(testing_dataset_small, FUN=function(x) {
words <- unlist(strsplit(tolower(x), " "))
if (length(words)>4) {
results <- c()
for (i in 1:(length(words)-3)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
word4 <- gsub("[^a-z']","",words[i+3])
start <- Sys.time()
data <- get_tetragrams_p_laplace_smoothing(paste(word1,
word2,
word3,
word4,sep=" "))
end <- Sys.time()
t <- as.numeric(end - start)
predicted_words <- sapply(names(data[[1]]), FUN=function(x){
predicted_words <- strsplit(x," ")[[1]]
predicted_last_word <- predicted_words[length(predicted_words)]
predicted_last_word
})
if (!is.na(word2) & (!"" %in% words)) {
if (is.na(data[[2]])){data[[2]]<-1/length(dictionary)}
results <- c(results,
word4 %in% predicted_words[1:3],
word4 %in% predicted_words[1:2],
word4 %in% predicted_words[1],
unique(!is.na(data[[1]])),
data[[2]],
t)
}
results
}
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
results2_means
}
})
predictions_4_grams <- matrix(unlist(predictions_4_grams), ncol=6, byrow=TRUE)
predictions_4_grams_means <- sapply(data.frame(predictions_4_grams), FUN=mean, na.rm=TRUE)
predictions_4_grams_means[5] <- exp(sum(log(predictions_4_grams[,5]))/length(predictions_4_grams[,5]))
predictions_5_grams <- sapply(testing_dataset_small, FUN=function(x) {
words <- unlist(strsplit(tolower(x), " "))
if (length(words)>5) {
results <- c()
for (i in 1:(length(words)-4)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
word4 <- gsub("[^a-z']","",words[i+3])
word5 <- gsub("[^a-z']","",words[i+4])
start <- Sys.time()
data <- get_pentagrams_p_laplace_smoothing(paste(word1,
word2,
word3,
word4,
word5,sep=" "))
end <- Sys.time()
t <- as.numeric(end - start)
predicted_words <- sapply(names(data[[1]]), FUN=function(x){
predicted_words <- strsplit(x," ")[[1]]
predicted_last_word <- predicted_words[length(predicted_words)]
predicted_last_word
})
if (!is.na(word2) & (!"" %in% words)) {
if (is.na(data[[2]])){data[[2]]<-1/length(dictionary)}
results <- c(results,
word5 %in% predicted_words[1:3],
word5 %in% predicted_words[1:2],
word5 %in% predicted_words[1],
unique(!is.na(data[[1]])),
data[[2]],
t)
}
results
}
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
results2_means
}
})
predictions_5_grams <- matrix(unlist(predictions_5_grams), ncol=6, byrow=TRUE)
predictions_5_grams_means <- sapply(data.frame(predictions_5_grams), FUN=mean, na.rm=TRUE)
predictions_5_grams_means[5] <- exp(sum(log(predictions_5_grams[,5]))/length(predictions_5_grams[,5]))
results_final_laplace <- data.frame(rbind(predictions_5_grams_means,
predictions_4_grams_means,
predictions_3_grams_means,
predictions_2_grams_means))
results_final_laplace[,6] <- round(results_final_laplace[,6],5)
results_final_laplace[,1:5] <- round(results_final_laplace[,1:5],3)
results_final_laplace[,5] <- round(results_final_laplace[,5],0)
colnames(results_final_laplace) <- c("Accuracy 3w",
"Accuracy 2w",
"Accuracy 1w",
"Coverage",
"Perplexity",
"Time (sec)")
rownames(results_final_laplace) <- c("Laplace 5g",
"Laplace 4g",
"Laplace 3g",
"Laplace 2g")
results_final_laplace
# process words
example1 <- "Thank you so much for this"
example2 <- "I will be a mechanic"
stupid_back_off <- function(x) {
words <- strsplit(tolower(x)," ")[[1]]
#calculates probability based on 5-gram model
p <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(words[1:4], collapse=" ") &
ngrams_5_freq['word5'] == words[5], 'P']
#back-off to 4-gram model
if (length(p)==0) {
p <- 0.4*ngrams_4_freq[ngrams_4_freq['trigram']==paste(words[2:4], collapse=" ") &
ngrams_4_freq['word4'] == words[5], 'P']
}
#back-off to 3-gram model
if (length(p)==0) {
p <- 0.4*0.4*ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[3:4], collapse=" ") &
ngrams_3_freq['word3'] == words[5], 'P']
}
#back-off to 2-gram model
if (length(p)==0) {
p <- 0.4*0.4*0.4*ngrams_2_freq[ngrams_2_freq['word1']==words[4] &
ngrams_2_freq['word2'] == words[5], 'P']
}
#back-off to 1-gram model
if (length(p)==0) {
p <- 0.4*0.4*0.4*0.4*(1/length(dictionary))
}
p
}
stupid_back_off(example1)
stupid_back_off(example2)
predictions_SBO_5 <- sapply(testing_dataset_small, FUN=function(x) {
x <- tolower(x)
words <- unlist(strsplit(x, " "))
word1 <- NA
word2 <- NA
word3 <- NA
word4 <- NA
word5 <- NA
results <- c()
if (length(words)>2) {
for (i in 1:(length(words)-2)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
word4 <- gsub("[^a-z']","",words[i+3])
word5 <- gsub("[^a-z']","",words[i+4])
cleaned_words <- c(word1,word2,word3,word4,word5)
cleaned_words <- cleaned_words[!is.na(cleaned_words)]
start <- Sys.time()
p_5_gram <- c()
tail_word_5_gram <- c()
p_4_gram <- c()
tail_word_4_gram <- c()
p_3_gram <- c()
tail_word_3_gram <- c()
p_2_gram <- c()
tail_word_2_gram <- c()
L <- length(cleaned_words)
if (L>4) {
p_5_gram <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(cleaned_words[(L-4):(L-1)]
,collapse=" "),
'P']
if (length(p_5_gram)!=0) {
tail_word_5_gram <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(cleaned_words[(L-4):(L-1)],
collapse=" "),'word5']
names(p_5_gram) <- paste(paste(cleaned_words[(L-4):L],collapse=" "), tail_word_5_gram, sep=" ")
}
}
if (L>3) {
p_4_gram <- ngrams_4_freq[ngrams_4_freq['trigram']==paste(cleaned_words[(L-3):(L-1)],
collapse=" ")&
!ngrams_4_freq$word4 %in% tail_word_5_gram,
'P']*0.4
if (length(p_4_gram)!=0) {
tail_word_4_gram <- ngrams_4_freq[ngrams_4_freq['trigram']==paste(cleaned_words[(L-3):(L-1)],
collapse=" ") &
!ngrams_4_freq$word4 %in% tail_word_5_gram,'word4']
names(p_4_gram) <- paste(paste(cleaned_words[(L-3):L],collapse=" "), tail_word_4_gram, sep=" ")
}
}
if (L>2) {
p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
collapse=" ") &
!ngrams_3_freq$word3 %in% tail_word_4_gram &
!ngrams_3_freq$word3 %in% tail_word_5_gram,
'P']*0.4*0.4
if (length(p_3_gram)!=0) {
tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
collapse=" ") &
!ngrams_3_freq$word3 %in% tail_word_4_gram &
!ngrams_3_freq$word3 %in% tail_word_5_gram,'word3']
names(p_3_gram) <- paste(paste(cleaned_words[(L-2):L],collapse=" "), tail_word_3_gram, sep=" ")
}
}
if (L>1) {
p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
!ngrams_2_freq$word2 %in% tail_word_3_gram &
!ngrams_2_freq$word2 %in% tail_word_4_gram &
!ngrams_2_freq$word2 %in% tail_word_5_gram,
'P']*0.4*0.4*0.4
if (length(p_2_gram)!=0) {
tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
!ngrams_2_freq$word2 %in% tail_word_3_gram &
!ngrams_2_freq$word2 %in% tail_word_4_gram &
!ngrams_2_freq$word2 %in% tail_word_5_gram,
'word2']
names(p_2_gram) <- paste(paste(cleaned_words[(L-1):L],collapse=" "), tail_word_2_gram, sep=" ")
}
}
tail_word_1_gram <- dictionary[!dictionary %in% tail_word_5_gram &
!dictionary %in% tail_word_4_gram &
!dictionary %in% tail_word_3_gram &
!dictionary %in% tail_word_2_gram]
p_1_gram <- 1/length(dictionary)
p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))*0.4*0.4*0.4*0.4
names(p_1_gram) <- tail_word_1_gram
p_SBO <- sort(c(p_5_gram, p_4_gram, p_3_gram, p_2_gram, p_1_gram),decreasing=TRUE)
end <- Sys.time()
t <- as.numeric(end - start)
predicted_words <- sapply(names(p_SBO), FUN=function(x){
predicted_words <- strsplit(x," ")[[1]]
predicted_last_word <- predicted_words[length(predicted_words)]
predicted_last_word
})
names(p_SBO) <- predicted_words
if (!cleaned_words[L] %in% dictionary) {
results <- c(results,
cleaned_words[L] %in% predicted_words[1:3],
cleaned_words[L] %in% predicted_words[1:2],
cleaned_words[L] %in% predicted_words[1],
0,
1/length(dictionary),
t)
} else {
results <- c(results,
cleaned_words[L] %in% predicted_words[1:3],
cleaned_words[L] %in% predicted_words[1:2],
cleaned_words[L] %in% predicted_words[1],
unique(!is.na(predicted_words)),
1/p_SBO[cleaned_words[L]],
t)
}
}
}
if (length(results)!=0) {
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
results2_means
}
})
predictions_SBO_5 <- matrix(unlist(predictions_SBO_5), ncol=6, byrow=TRUE)
predictions_SBO_5_means <- sapply(data.frame(predictions_SBO_5), FUN=mean, na.rm=TRUE)
predictions_SBO_5[5] <- exp(sum(log(predictions_SBO_5[,5]))/length(predictions_SBO_5[,5]))
predictions_SBO_3 <- sapply(testing_dataset_small, FUN=function(x) {
x <- tolower(x)
words <- unlist(strsplit(x, " "))
word1 <- NA
word2 <- NA
word3 <- NA
results <- c()
if (length(words)>2) {
for (i in 1:(length(words)-2)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
cleaned_words <- c(word1,word2,word3)
cleaned_words <- cleaned_words[!is.na(cleaned_words)]
start <- Sys.time()
p_3_gram <- c()
tail_word_3_gram <- c()
p_2_gram <- c()
tail_word_2_gram <- c()
L <- length(cleaned_words)
if (L>2) {
p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],collapse=" "),
'P']
if (length(p_3_gram)!=0) {
tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
collapse=" "),
'word3']
names(p_3_gram) <- paste(paste(cleaned_words[L-2:L],collapse=" "), tail_word_3_gram, sep=" ")
}
}
if (L>1) {
p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
!ngrams_2_freq$word2 %in% tail_word_3_gram,
'P']*0.4
if (length(p_2_gram)!=0) {
tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
!ngrams_2_freq$word2 %in% tail_word_3_gram,
'word2']
names(p_2_gram) <- paste(paste(cleaned_words[L-1:L],collapse=" "), tail_word_2_gram, sep=" ")
}
}
tail_word_1_gram <- dictionary[!dictionary %in% tail_word_3_gram &
!dictionary %in% tail_word_2_gram]
p_1_gram <- 1/length(dictionary)*0.4*0.4
p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))
names(p_1_gram) <- tail_word_1_gram
p_SBO <- sort(c(p_3_gram, p_2_gram, p_1_gram),decreasing=TRUE)
end <- Sys.time()
t <- as.numeric(end - start)
predicted_words <- sapply(names(p_SBO), FUN=function(x){
predicted_words <- strsplit(x," ")[[1]]
predicted_last_word <- predicted_words[length(predicted_words)]
predicted_last_word
})
names(p_SBO) <- predicted_words
if (!cleaned_words[L] %in% dictionary) {
results <- c(results,
cleaned_words[L] %in% predicted_words[1:3],
cleaned_words[L] %in% predicted_words[1:2],
cleaned_words[L] %in% predicted_words[1],
0,
1/length(dictionary),
t)
} else {
results <- c(results,
cleaned_words[L] %in% predicted_words[1:3],
cleaned_words[L] %in% predicted_words[1:2],
cleaned_words[L] %in% predicted_words[1],
unique(!is.na(predicted_words)),
1/p_SBO[cleaned_words[L]],
t)
}
}
}
if (length(results)!=0) {
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
results2_means
}
})
predictions_SBO_3 <- matrix(unlist(predictions_SBO_3), ncol=6, byrow=TRUE)
predictions_SBO_3_means <- sapply(data.frame(predictions_SBO_3), FUN=mean, na.rm=TRUE)
predictions_SBO_3[5] <- exp(sum(log(predictions_SBO_3[,5]))/length(predictions_SBO_3[,5]))
results_final_SBO <- data.frame(t(cbind(predictions_SBO_5_means,
predictions_SBO_3_means)))
results_final_SBO[,6] <- round(results_final_SBO[,6],5)
results_final_SBO[,1:5] <- round(results_final_SBO[,1:5],3)
results_final_SBO[,5] <- round(results_final_SBO[,5],0)
colnames(results_final_SBO) <- c("Accuracy 3w","Accuracy 2w","Accuracy 1w",
"Coverage","Perplexity","Time (sec)")
rownames(results_final_SBO) <- c("SBO 5->1g","SBO 3->1g")
results_final_SBO
# process words
library(edgeR)
#calculate discounts based on the Good-Turing extimation for unobserved ngrams
discount_5_gram <- sum(goodTuring(ngrams_5_freq$freq)['proportion'][[1]])
discount_4_gram <- sum(goodTuring(ngrams_4_freq$freq)['proportion'][[1]])
discount_3_gram <- sum(goodTuring(ngrams_3_freq$freq)['proportion'][[1]])
discount_2_gram <- sum(goodTuring(ngrams_2_freq$freq)['proportion'][[1]])
example <- "i will be"
words <- unlist(strsplit(example," "))
#discounted 3-gram probabilites
p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1:2],collapse=" "), 'P'] * discount_3_gram
tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(words[1:2],collapse=" "), 'word3']
names(p_3_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_3_gram)
discounted_P_3_gram <- 1 - sum(p_3_gram)
#discounted 2-gram probabilites
p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] &
!ngrams_2_freq$word2 %in% tail_word_3_gram, 'P']
p_2_gram <- p_2_gram * discounted_P_3_gram/sum(p_2_gram) * discount_2_gram
tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==words[2] &
!ngrams_2_freq$word2 %in% tail_word_3_gram, 'word2']
names(p_2_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_2_gram, sep=" ")
discounted_P_2_gram <- 1 - sum(p_2_gram) - sum(p_3_gram)
#discounted 1-gram probabilites
tail_word_1_gram <- dictionary[!dictionary %in% tail_word_3_gram &
!dictionary %in% tail_word_2_gram]
p_1_gram <- 1/length(dictionary) * discounted_P_2_gram/sum(1/length(dictionary) * length(tail_word_1_gram))
p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))
names(p_1_gram) <- paste(paste(words[1:2],collapse=" "), tail_word_1_gram)
#final probability vector
p_final <- c(p_3_gram, p_2_gram, p_1_gram)
p_final['i will be']
predictions_KBO_5 <- sapply(testing_dataset_small, FUN=function(x) {
x <- tolower(x)
words <- unlist(strsplit(x, " "))
results <- c()
word1 <- NA
word2 <- NA
word3 <- NA
word4 <- NA
word5 <- NA
if (length(words)>2) {
for (i in 1:(length(words)-2)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
word4 <- gsub("[^a-z']","",words[i+3])
word5 <- gsub("[^a-z']","",words[i+4])
cleaned_words <- c(word1,word2,word3,word4,word5)
cleaned_words <- cleaned_words[!is.na(cleaned_words)]
start <- Sys.time()
p_5_gram <- c()
tail_word_5_gram <- c()
discounted_P_5_gram <- 1
p_4_gram <- c()
tail_word_4_gram <- c()
discounted_P_4_gram <- 1
p_3_gram <- c()
tail_word_3_gram <- c()
discounted_P_3_gram <- 1
p_2_gram <- c()
tail_word_2_gram <- c()
discounted_P_2_gram <- 1
L <- length(cleaned_words)
if (L>4) {
p_5_gram <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(cleaned_words[(L-4):(L-1)],
collapse=" "),
'P']
p_5_gram <- p_5_gram * discount_5_gram
if (length(p_5_gram)!=0) {
tail_word_5_gram <- ngrams_5_freq[ngrams_5_freq['tetragram']==paste(cleaned_words[(L-4):(L-1)],
collapse=" "),'word5']
names(p_5_gram) <- paste(paste(cleaned_words[(L-4):(L-1)],collapse=" "),
tail_word_5_gram, sep=" ")
discounted_P_5_gram <- 1 - sum(p_5_gram)
}
}
if (L>3) {
p_4_gram <- ngrams_4_freq[ngrams_4_freq['trigram']==paste(cleaned_words[(L-3):(L-1)],
collapse=" ") &
!ngrams_4_freq$word4 %in% tail_word_5_gram,
'P']
p_4_gram <- p_4_gram * discounted_P_5_gram/sum(p_4_gram) * discount_4_gram
if (length(p_4_gram)!=0) {
tail_word_4_gram <- ngrams_4_freq[ngrams_4_freq['trigram']==paste(cleaned_words[(L-3):(L-1)],
collapse=" ") &
!ngrams_4_freq$word4 %in% tail_word_5_gram,'word4']
names(p_4_gram) <- paste(paste(cleaned_words[(L-3):(L-1)],collapse=" "),
tail_word_4_gram, sep=" ")
discounted_P_4_gram <- 1 - sum(p_4_gram) - sum(p_5_gram)
}
}
if (L>2) {
p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
collapse=" ") &
!ngrams_3_freq$word3 %in% tail_word_4_gram &
!ngrams_3_freq$word3 %in% tail_word_5_gram,
'P']
p_3_gram <- p_3_gram * discounted_P_4_gram/sum(p_3_gram) * discount_3_gram
if (length(p_3_gram)!=0) {
tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
collapse=" ") &
!ngrams_3_freq$word3 %in% tail_word_4_gram &
!ngrams_3_freq$word3 %in% tail_word_5_gram,'word3']
names(p_3_gram) <- paste(paste(cleaned_words[(L-2):(L-1)],
collapse=" "), tail_word_3_gram, sep=" ")
discounted_P_3_gram <- 1 - sum(p_3_gram) - sum(p_4_gram) - sum(p_5_gram)
}
}
if (L>1) {
p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
!ngrams_2_freq$word2 %in% tail_word_3_gram &
!ngrams_2_freq$word2 %in% tail_word_4_gram &
!ngrams_2_freq$word2 %in% tail_word_5_gram,
'P']
p_2_gram <- p_2_gram * discounted_P_3_gram/sum(p_2_gram) * discount_2_gram
if (length(p_2_gram)!=0) {
tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
!ngrams_2_freq$word2 %in% tail_word_3_gram &
!ngrams_2_freq$word2 %in% tail_word_4_gram &
!ngrams_2_freq$word2 %in% tail_word_5_gram,
'word2']
names(p_2_gram) <- paste(paste(cleaned_words[L-1:L],collapse=" "), tail_word_2_gram, sep=" ")
discounted_P_2_gram <- 1 - sum(p_2_gram) - sum(p_3_gram) - sum(p_4_gram) - sum(p_5_gram)
}
}
tail_word_1_gram <- dictionary[!dictionary %in% tail_word_5_gram &
!dictionary %in% tail_word_4_gram &
!dictionary %in% tail_word_3_gram &
!dictionary %in% tail_word_2_gram]
p_1_gram <- 1/length(dictionary) * discounted_P_2_gram/sum(1/length(dictionary) *
length(tail_word_1_gram))
p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))
names(p_1_gram) <- tail_word_1_gram
p_KBO <- sort(c(p_5_gram, p_4_gram, p_3_gram, p_2_gram, p_1_gram),decreasing=TRUE)
end <- Sys.time()
t <- as.numeric(end - start)
predicted_words <- sapply(names(p_KBO), FUN=function(x){
predicted_words <- strsplit(x," ")[[1]]
predicted_last_word <- predicted_words[length(predicted_words)]
predicted_last_word
})
names(p_KBO) <- predicted_words
if (!cleaned_words[L] %in% dictionary) {
results <- c(results,
cleaned_words[L] %in% predicted_words[1:3],
cleaned_words[L] %in% predicted_words[1:2],
cleaned_words[L] %in% predicted_words[1],
0,
1/length(dictionary),
t)
} else {
results <- c(results,
cleaned_words[L] %in% predicted_words[1:3],
cleaned_words[L] %in% predicted_words[1:2],
cleaned_words[L] %in% predicted_words[1],
unique(!is.na(predicted_words)),
1/p_KBO[cleaned_words[L]],
t)
}
}
}
if (length(results)) {
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
results2_means
}
})
predictions_KBO_5 <- matrix(unlist(predictions_KBO_5), ncol=6, byrow=TRUE)
predictions_KBO_5_means <- sapply(data.frame(predictions_KBO_5), FUN=mean, na.rm=TRUE)
predictions_KBO_5[5] <- exp(sum(log(predictions_KBO_5[,5]))/length(predictions_KBO_5[,5]))
predictions_KBO_3 <- sapply(testing_dataset_small, FUN=function(x) {
x <- tolower(x)
words <- unlist(strsplit(x, " "))
results <- c()
word1 <- NA
word2 <- NA
word3 <- NA
if (length(words)>2) {
for (i in 1:(length(words)-2)) {
word1 <- gsub("[^a-z']","",words[i])
word2 <- gsub("[^a-z']","",words[i+1])
word3 <- gsub("[^a-z']","",words[i+2])
cleaned_words <- c(word1,word2,word3)
# cleaned_words <- c("lot","of",NA,NA,NA)
cleaned_words <- cleaned_words[!is.na(cleaned_words)]
p_3_gram <- c()
tail_word_3_gram <- c()
discounted_P_3_gram <- 1
p_2_gram <- c()
tail_word_2_gram <- c()
discounted_P_2_gram <- 1
L <- length(cleaned_words)
start <- Sys.time()
if (L>2) {
p_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
collapse=" "),
'P']
p_3_gram <- p_3_gram * discount_3_gram
if (length(p_3_gram)!=0) {
tail_word_3_gram <- ngrams_3_freq[ngrams_3_freq['bigram']==paste(cleaned_words[(L-2):(L-1)],
collapse=" "),
'word3']
names(p_3_gram) <- paste(paste(cleaned_words[(L-2):(L-1)],collapse=" "),
tail_word_3_gram, sep=" ")
discounted_P_3_gram <- 1 - sum(p_3_gram)
}
}
if (L>1) {
p_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
!ngrams_2_freq$word2 %in% tail_word_3_gram,
'P']
p_2_gram <- p_2_gram * discounted_P_3_gram/sum(p_2_gram) * discount_2_gram
if (length(p_2_gram)!=0) {
tail_word_2_gram <- ngrams_2_freq[ngrams_2_freq['word1']==cleaned_words[L-1] &
!ngrams_2_freq$word2 %in% tail_word_3_gram,
'word2']
names(p_2_gram) <- paste(paste(cleaned_words[L-1:L],collapse=" "), tail_word_2_gram, sep=" ")
discounted_P_2_gram <- 1 - sum(p_2_gram) - sum(p_3_gram)
}
}
tail_word_1_gram <- dictionary[!dictionary %in% tail_word_3_gram &
!dictionary %in% tail_word_2_gram]
p_1_gram <- 1/length(dictionary) * discounted_P_2_gram/sum(1/length(dictionary) *
length(tail_word_1_gram))
p_1_gram <- rep(p_1_gram, length(tail_word_1_gram))
names(p_1_gram) <- tail_word_1_gram
p_KBO <- sort(c(p_3_gram, p_2_gram, p_1_gram),decreasing=TRUE)
end <- Sys.time()
t <- as.numeric(end - start)
predicted_words <- sapply(names(p_KBO), FUN=function(x){
predicted_words <- strsplit(x," ")[[1]]
predicted_last_word <- predicted_words[length(predicted_words)]
predicted_last_word
})
names(p_KBO) <- predicted_words
if (!cleaned_words[L] %in% dictionary) {
results <- c(results,
cleaned_words[L] %in% predicted_words[1:3],
cleaned_words[L] %in% predicted_words[1:2],
cleaned_words[L] %in% predicted_words[1],
0,
1/length(dictionary),
t)
} else {
results <- c(results,
cleaned_words[L] %in% predicted_words[1:3],
cleaned_words[L] %in% predicted_words[1:2],
cleaned_words[L] %in% predicted_words[1],
unique(!is.na(predicted_words)),
1/p_KBO[cleaned_words[L]],
t)
}
}
}
if (length(results)) {
results2 <- matrix(results, ncol=6, byrow=TRUE)
results2_means <- unlist(lapply(data.frame(results2), FUN=mean, MARGIN=2, na.rm=TRUE))
results2_means[5] <- exp(sum(log(results2[,5]))/length(results2[,5]))
results2_means
}
})
predictions_KBO_3 <- matrix(unlist(predictions_KBO_3), ncol=6, byrow=TRUE)
predictions_KBO_3_means <- sapply(data.frame(predictions_KBO_3), FUN=mean, na.rm=TRUE)
predictions_KBO_3[5] <- exp(sum(log(predictions_KBO_3[,5]))/length(predictions_KBO_3[,5]))
results_final_KBO <- data.frame(t(cbind(predictions_KBO_5_means,
predictions_KBO_3_means)))
results_final_KBO[,6] <- round(results_final_KBO[,6],5)
results_final_KBO[,1:5] <- round(results_final_KBO[,1:5],3)
results_final_KBO[,5] <- round(results_final_KBO[,5],0)
colnames(results_final_KBO) <- c("Accuracy 3w","Accuracy 2w","Accuracy 1w",
"Coverage","Perplexity","Time (sec)")
rownames(results_final_KBO) <- c("KBO 5->1g","KBO 3->1g")
results_final_KBO
data.frame(rbind(results_final,
results_final_liniter,
results_final_laplace,
results_final_SBO,
results_final_KBO))