This is the first report paper in the course of Data Science Capstone project. The project is about prediction of the next word given user input of the context. The goal of this document is to convey the results of the exploratory analysis of the test datasets, provided along with the assignment, and to formulate the approach for modelling the prediction. The latter will be used in the development of the app, delivering real-time prediction.
For the sake of limiting the scope and reducing the runtime, we have limited the input source to twitter, under fixed locale en_US, randomly feeding 5000 lines of source corpus.
First step is to construct a corpus, using the package tm.
The total number of lines is 5000, having the 4.7945^{4} words, among which 9816 are unique.
Short descriptives for the frequencies and most frequent words are as follows.| word | freq | word | freq | word | freq | word | freq | word | freq | word | freq |
|---|---|---|---|---|---|---|---|---|---|---|---|
| the | 1915 | from | 180 | still | 95 | 4 items | 57 | 8 items | 29 | 6147 items | 1 |
| you | 1151 | today | 175 | thank | 90 | looking | 56 | 9 items | 28 | ||
| and | 886 | know | 174 | had | 89 | 2 items | 55 | 9 items | 27 | ||
| for | 799 | can | 169 | night | 88 | 2 items | 54 | 10 items | 26 | ||
| that | 533 | time | 163 | 2 items | 87 | 4 items | 53 | 11 items | 25 | ||
| with | 397 | how | 162 | 3 items | 86 | 3 items | 52 | 10 items | 24 | ||
| your | 370 | now | 157 | been | 85 | 2 items | 51 | 9 items | 23 | ||
| have | 366 | new | 154 | best | 83 | 3 items | 50 | 14 items | 22 | ||
| this | 340 | great | 153 | tonight | 82 | 5 items | 49 | 14 items | 21 | ||
| just | 319 | see | 152 | she | 81 | 3 items | 48 | 16 items | 20 | ||
| but | 305 | our | 148 | 2 items | 80 | 3 items | 47 | 18 items | 19 | ||
| are | 297 | they | 140 | 3 items | 79 | 5 items | 46 | 19 items | 18 | ||
| its | 281 | would | 136 | work | 78 | 4 items | 45 | 15 items | 17 | ||
| all | 279 | lol | 134 | off | 77 | 8 items | 44 | 22 items | 16 | ||
| out | 258 | back | 133 | life | 75 | 5 items | 43 | 25 items | 15 | ||
| what | 245 | there | 129 | then | 73 | 4 items | 42 | 38 items | 14 | ||
| get | 241 | too | 128 | 3 items | 71 | 4 items | 41 | 45 items | 13 | ||
| like | 240 | some | 122 | 2 items | 70 | 7 items | 40 | 38 items | 12 | ||
| 2 items | 238 | more | 117 | should | 69 | 3 items | 39 | 49 items | 11 | ||
| love | 231 | cant | 116 | 2 items | 68 | 4 items | 38 | 60 items | 10 | ||
| good | 206 | happy | 115 | 2 items | 67 | 4 items | 37 | 64 items | 9 | ||
| thanks | 204 | need | 109 | 3 items | 66 | 3 items | 36 | 89 items | 8 | ||
| will | 192 | 3 items | 108 | 3 items | 64 | 7 items | 35 | 122 items | 7 | ||
| about | 191 | 2 items | 105 | their | 63 | 2 items | 34 | 174 items | 6 | ||
| day | 189 | people | 104 | 3 items | 62 | 9 items | 33 | 239 items | 5 | ||
| when | 188 | who | 101 | way | 61 | 9 items | 32 | 353 items | 4 | ||
| dont | 185 | 2 items | 100 | 3 items | 60 | 4 items | 31 | 608 items | 3 | ||
| one | 184 | 4 items | 98 | 2 items | 58 | 7 items | 30 | 1348 items | 2 |
There is a theoretical model behind these stats shown as a solid line in the graph above, and as the modelling shows, the words freqs could be approximated by a power function count ~ 1902.51*word_rank^-0.73
| word | freq | word | freq | word | freq |
|---|---|---|---|---|---|
| in_th | 155 | 5 items | 35 | ||
| for_th | 152 | 4 items | 33 | ||
| of_th | 116 | 3 items | 32 | ||
| to_b | 101 | 3 items | 31 | ||
| 2 items | 98 | this_i | 30 | ||
| i_hav | 91 | 3 items | 29 | ||
| at_th | 90 | 4 items | 27 | ||
| thank_you | 77 | 7 items | 26 | ||
| i_lov | 75 | 6 items | 25 | ||
| have_a | 72 | 7 items | 24 | ||
| to_th | 70 | 8 items | 23 | ||
| to_get | 63 | 6 items | 22 | ||
| to_se | 62 | 11 items | 21 | ||
| 2 items | 59 | 10 items | 20 | ||
| for_a | 58 | 11 items | 19 | ||
| 2 items | 57 | 17 items | 18 | ||
| 2 items | 56 | 16 items | 17 | ||
| if_you | 55 | 18 items | 16 | ||
| 2 items | 54 | 18 items | 15 | ||
| 2 items | 53 | 20 items | 14 | ||
| i_just | 49 | 33 items | 13 | ||
| 3 items | 47 | 34 items | 12 | ||
| 3 items | 46 | 52 items | 11 | ||
| 3 items | 45 | 69 items | 10 | ||
| 3 items | 44 | 68 items | 9 | ||
| 2 items | 43 | 98 items | 8 | ||
| 2 items | 42 | 116 items | 7 | ||
| 4 items | 41 | 190 items | 6 | ||
| 2 items | 40 | 268 items | 5 | ||
| 3 items | 39 | 508 items | 4 | ||
| it_wa | 38 | 999 items | 3 | ||
| 2 items | 37 | 2948 items | 2 | ||
| 3 items | 36 | 32509 items | 1 |
Obviously, for the 3-grams, the distribution is even more uniform, with a very little common data on actually ocuring 3-grams.
| word | freq | word | freq |
|---|---|---|---|
| thanks_for_th | 51 | ||
| 4 items | 20 | ||
| thank_you_for | 18 | ||
| i_love_you | 16 | ||
| one_of_th | 15 | ||
| 2 items | 14 | ||
| 3 items | 13 | ||
| a_lot_of | 12 | ||
| 5 items | 11 | ||
| 3 items | 10 | ||
| 12 items | 9 | ||
| 8 items | 8 | ||
| 19 items | 7 | ||
| 27 items | 6 | ||
| 48 items | 5 | ||
| 125 items | 4 | ||
| 237 items | 3 | ||
| 1308 items | 2 | ||
| 47274 items | 1 |
We investigated the “Ghini curve” for the words, based on the first chart in this section. It appears, that 217 words (2.2%) cover 50% of all the words in the corpus, while 5022 words (51.2%) unique constitute 90% of all words
The most straightforward way is to check the non-english words for presence in any other dictionary. For this purpose we applied package “hunspell”, however several commonly used words, names and acronyms appear to be missing from the typical en package, while present in the others. Nevertheless, there is quite natural presence of foregin words observed, namely:
German:
## [1] "matt" "los" "albert" "dem" "http" "las" "allen" "dir"
## [9] "phil" "tue" "acta" "adele" "andres" "artest" "das" "fing"
## [17] "holdem" "reich" "wer" "zimmer"
And Spanish:
## [1] "san" "vegas" "cuz" "florida" "dallas"
## [6] "diego" "cinco" "york" "ron" "james"
## [11] "los" "usa" "california" "carolina" "dan"
## [16] "del" "indiana" "las" "prez" "seo"
## [21] "ala" "aries" "bella" "ben" "bruce"
## [26] "charles" "cole" "gilbert" "leo" "lisa"
## [31] "maya" "orlando" "rivera" "rubio" "acta"
## [36] "ali" "ama" "americana" "arcadia" "argentina"
## [41] "axe" "ballesteros" "barbados" "barbara" "bel"
## [46] "bernardo" "bon" "bueno" "caliente" "capote"
## [51] "casa" "ceda" "centre" "cha" "chacha"
## [56] "chica" "chorizo" "coa" "colo" "colorado"
## [61] "conjuguemos" "cumbre" "das" "delgado" "diablo"
## [66] "diegos" "dolo" "ele" "espada" "estela"
## [71] "floridas" "fonda" "francisco" "franciscos" "gagas"
## [76] "gaza" "gigante" "gis" "gomer" "gracias"
## [81] "indianas" "industria" "jamba" "jane" "jazmines"
## [86] "junto" "juvenal" "latina" "latinas" "latino"
## [91] "latinos" "lego" "legos" "libros" "luis"
## [96] "lunas" "mariano" "mauro" "maxwell" "mira"
## [101] "mirasol" "mondo" "mongo" "monte" "moro"
## [106] "musa" "muslim" "nazi" "nene" "nicaragua"
## [111] "nidal" "nin" "ocho" "olas" "pamela"
## [116] "picos" "poste" "puerto" "rajan" "ramones"
## [121] "ria" "romano" "santa" "santiago" "santos"
## [126] "sean" "soso" "sudan" "sus" "tatas"
## [131] "tıme" "titos" "todo" "toro" "tosa"
## [136] "ufos" "usas" "uta" "vargas" "verde"
## [141] "vid" "yaya"
The process could be refined,m excluding names, explicit misspellings and abbreviations and maybe referring to the other dictionaries, accounting for frequency counts.
One possible way to “manually” increase the coverage, is the replacing words in the corpus with their synonyms, selecting just the most frequent. Here are the top words synonyms (nouns)
| syn | freq | |
|---|---|---|
| love | love,dear,passion,honey,beloved | 231 |
| good | good,goodness | 206 |
| one | one,single,ace | 184 |
| can | can,ass,behind,john,butt,toilet,seat,bottom,potty,throne,tin,bathroom,pot,stern | 169 |
| time | time,sentence | 163 |
| back | back,cover | 133 |
| need | need,want,demand,motivation | 109 |
| make | make,brand | 105 |
| people | people,mass | 104 |
| going | going,release,loss,exit,leaving,passing,expiration | 98 |
| want | need,want,wish,wishing,lack | 98 |
| night | night,dark | 88 |
| come | come,seed | 86 |
| hope | hope,promise | 86 |
| last | last,end,close,death,finish,finale | 86 |
| work | work,study,workplace | 78 |
| life | life,living,lifetime,spirit | 75 |
| game | game,biz | 67 |
There is also a problem of not all the words from the corpus contain in the standard dictionary, even if we exclude the names, foreign words, typos, etc.
The possible solution could be:
Include in the extended dictionary all the findings
Correct typos, using relevant typos dictionary or distance measure
Use one of the subject related dictionary, e.g. twitter slang or abbreviations
General “standardization” of the dictionary, “synonyms” approach is naive, but more robust way may be sound (e.g. investigating the word variations)
Basic modelling could be done through application of estimated coefficients in the transition probabilities. Smoothing is applied the most trivial: adding minor value and then normalizing. For the sake of simplicity we provide the code for only the 1-1 transitions
sequences <- lapply(corpus, function(doc) strsplit(as.character(doc), " ")[[1]])
build_transition_matrix <- function(sequences) {
vocabulary <- rownames(word_freq)
vocabulary_size <- length(vocabulary)
word_index <- match(unlist(sequences), vocabulary)
transition_matrix <- matrix(0, nrow = vocabulary_size, ncol = vocabulary_size)
for (sequence in sequences) {
for (i in 1:(length(sequence) - 1)) {
current_word <- word_index[i]
next_word <- word_index[i + 1]
transition_matrix[current_word, next_word] <- transition_matrix[current_word, next_word] + 1
}
}
transition_matrix <- t(apply(transition_matrix, 1, function(row) row / sum(row)))
return(list(vocabulary = vocabulary, transition_matrix = transition_matrix))
}
model <- build_transition_matrix(sequences)
vocabulary <- model$vocabulary
transition_matrix <- model$transition_matrix
predict_next_word <- function(sequence) {
current_word <- match(sequence[length(sequence)], vocabulary)
next_word_probabilities <- transition_matrix[current_word, ]
#smoothing
next_word_probabilities<-next_word_probabilities+0.0001
next_word_probabilities<-next_word_probabilities/sum(next_word_probabilities)
next_word <- sample(1:length(vocabulary), size = 1, prob = next_word_probabilities)
return(vocabulary[next_word])
}
# Example usage: generate predictions for a given sequence
sequence <- c("needs","are")
word_predict <- predict_next_word(sequence)
# Print the predicted next word
print(word_predict)
## [1] "just"
The model for unseen n-grams could be built as assuming it is a typo, hence finding the most close string in the names:
require(stringdist)
sequence <- c("needs","ar1")
sequence<-sapply(sequence,function(seq){
vocabulary[which(stringsim(seq,vocabulary)==max(stringsim(seq,vocabulary)))[1]]
})
word_predict <- predict_next_word(sequence)
# Print the predicted next word
print(word_predict)
## [1] "just"
The most efficient way is using Matrix(transition_matrix, sparse = TRUE), however I didn’t get enough time to implement it on this iteration
Firstly, I can group the words by frequency bins and estimate transaction betwen bins and then implement different submodels, like frequent-frequent, etc.
Then I can group the terms by context, leveraging simple language model: then transition firstly takes the next “cluster” and then picks up a representative. This should be memory-friendly
Group words by synonyms, reducing words quantity
Reduce the number of rare combinations of n-grams, adding random choice instead (simulating using the probabilities derived earlier)
In the 2 gram approach, I need n^2 parameters, which are the transition probabilities, if talking about 2gram-2gram transaction, n_2gram^2 will be needed.
Yes, I applied “bounce from zero” approach, in perfect world smoothing like Linear interpolation could be applied to replace, say, rare 3 gram, by weighted 1 and 2-grams.
The evaluation could be done by standard approach: cross-validating using train/test set separation
If the 3-gram entered by user is not present in the data, I can either discard first word in it and go with prediction by 2-gram residual, or I can predict using the third element by first two and then predict using the second pair, or the whole 3-gram, if it appears present.
We have described the general features of one of the dataset and suggested the most trivial model for the sake of concept evaluation.
Other features like extensive testing, scaling and adjusting for rare events, 2-3 grams implementation will be done in a course of product development.