Intro

This is the first report paper in the course of Data Science Capstone project. The project is about prediction of the next word given user input of the context. The goal of this document is to convey the results of the exploratory analysis of the test datasets, provided along with the assignment, and to formulate the approach for modelling the prediction. The latter will be used in the development of the app, delivering real-time prediction.

Exploratory analysis

Overview of the training datasets

For the sake of limiting the scope and reducing the runtime, we have limited the input source to twitter, under fixed locale en_US, randomly feeding 5000 lines of source corpus.

First step is to construct a corpus, using the package tm.

The total number of lines is 5000, having the 4.7945^{4} words, among which 9816 are unique.

Short descriptives for the frequencies and most frequent words are as follows.
word freq word freq word freq word freq word freq word freq
the 1915 from 180 still 95 4 items 57 8 items 29 6147 items 1
you 1151 today 175 thank 90 looking 56 9 items 28
and 886 know 174 had 89 2 items 55 9 items 27
for 799 can 169 night 88 2 items 54 10 items 26
that 533 time 163 2 items 87 4 items 53 11 items 25
with 397 how 162 3 items 86 3 items 52 10 items 24
your 370 now 157 been 85 2 items 51 9 items 23
have 366 new 154 best 83 3 items 50 14 items 22
this 340 great 153 tonight 82 5 items 49 14 items 21
just 319 see 152 she 81 3 items 48 16 items 20
but 305 our 148 2 items 80 3 items 47 18 items 19
are 297 they 140 3 items 79 5 items 46 19 items 18
its 281 would 136 work 78 4 items 45 15 items 17
all 279 lol 134 off 77 8 items 44 22 items 16
out 258 back 133 life 75 5 items 43 25 items 15
what 245 there 129 then 73 4 items 42 38 items 14
get 241 too 128 3 items 71 4 items 41 45 items 13
like 240 some 122 2 items 70 7 items 40 38 items 12
2 items 238 more 117 should 69 3 items 39 49 items 11
love 231 cant 116 2 items 68 4 items 38 60 items 10
good 206 happy 115 2 items 67 4 items 37 64 items 9
thanks 204 need 109 3 items 66 3 items 36 89 items 8
will 192 3 items 108 3 items 64 7 items 35 122 items 7
about 191 2 items 105 their 63 2 items 34 174 items 6
day 189 people 104 3 items 62 9 items 33 239 items 5
when 188 who 101 way 61 9 items 32 353 items 4
dont 185 2 items 100 3 items 60 4 items 31 608 items 3
one 184 4 items 98 2 items 58 7 items 30 1348 items 2

There is a theoretical model behind these stats shown as a solid line in the graph above, and as the modelling shows, the words freqs could be approximated by a power function count ~ 1902.51*word_rank^-0.73

Frequencies of 2- and 3- grams

For the 2-gram frequencies, the summary of the observations is as follows:
word freq word freq word freq
in_th 155 5 items 35
for_th 152 4 items 33
of_th 116 3 items 32
to_b 101 3 items 31
2 items 98 this_i 30
i_hav 91 3 items 29
at_th 90 4 items 27
thank_you 77 7 items 26
i_lov 75 6 items 25
have_a 72 7 items 24
to_th 70 8 items 23
to_get 63 6 items 22
to_se 62 11 items 21
2 items 59 10 items 20
for_a 58 11 items 19
2 items 57 17 items 18
2 items 56 16 items 17
if_you 55 18 items 16
2 items 54 18 items 15
2 items 53 20 items 14
i_just 49 33 items 13
3 items 47 34 items 12
3 items 46 52 items 11
3 items 45 69 items 10
3 items 44 68 items 9
2 items 43 98 items 8
2 items 42 116 items 7
4 items 41 190 items 6
2 items 40 268 items 5
3 items 39 508 items 4
it_wa 38 999 items 3
2 items 37 2948 items 2
3 items 36 32509 items 1

Obviously, for the 3-grams, the distribution is even more uniform, with a very little common data on actually ocuring 3-grams.

word freq word freq
thanks_for_th 51
4 items 20
thank_you_for 18
i_love_you 16
one_of_th 15
2 items 14
3 items 13
a_lot_of 12
5 items 11
3 items 10
12 items 9
8 items 8
19 items 7
27 items 6
48 items 5
125 items 4
237 items 3
1308 items 2
47274 items 1

Words coverage

We investigated the “Ghini curve” for the words, based on the first chart in this section. It appears, that 217 words (2.2%) cover 50% of all the words in the corpus, while 5022 words (51.2%) unique constitute 90% of all words

Words in other languages

The most straightforward way is to check the non-english words for presence in any other dictionary. For this purpose we applied package “hunspell”, however several commonly used words, names and acronyms appear to be missing from the typical en package, while present in the others. Nevertheless, there is quite natural presence of foregin words observed, namely:

German:

##  [1] "matt"   "los"    "albert" "dem"    "http"   "las"    "allen"  "dir"   
##  [9] "phil"   "tue"    "acta"   "adele"  "andres" "artest" "das"    "fing"  
## [17] "holdem" "reich"  "wer"    "zimmer"

And Spanish:

##   [1] "san"         "vegas"       "cuz"         "florida"     "dallas"     
##   [6] "diego"       "cinco"       "york"        "ron"         "james"      
##  [11] "los"         "usa"         "california"  "carolina"    "dan"        
##  [16] "del"         "indiana"     "las"         "prez"        "seo"        
##  [21] "ala"         "aries"       "bella"       "ben"         "bruce"      
##  [26] "charles"     "cole"        "gilbert"     "leo"         "lisa"       
##  [31] "maya"        "orlando"     "rivera"      "rubio"       "acta"       
##  [36] "ali"         "ama"         "americana"   "arcadia"     "argentina"  
##  [41] "axe"         "ballesteros" "barbados"    "barbara"     "bel"        
##  [46] "bernardo"    "bon"         "bueno"       "caliente"    "capote"     
##  [51] "casa"        "ceda"        "centre"      "cha"         "chacha"     
##  [56] "chica"       "chorizo"     "coa"         "colo"        "colorado"   
##  [61] "conjuguemos" "cumbre"      "das"         "delgado"     "diablo"     
##  [66] "diegos"      "dolo"        "ele"         "espada"      "estela"     
##  [71] "floridas"    "fonda"       "francisco"   "franciscos"  "gagas"      
##  [76] "gaza"        "gigante"     "gis"         "gomer"       "gracias"    
##  [81] "indianas"    "industria"   "jamba"       "jane"        "jazmines"   
##  [86] "junto"       "juvenal"     "latina"      "latinas"     "latino"     
##  [91] "latinos"     "lego"        "legos"       "libros"      "luis"       
##  [96] "lunas"       "mariano"     "mauro"       "maxwell"     "mira"       
## [101] "mirasol"     "mondo"       "mongo"       "monte"       "moro"       
## [106] "musa"        "muslim"      "nazi"        "nene"        "nicaragua"  
## [111] "nidal"       "nin"         "ocho"        "olas"        "pamela"     
## [116] "picos"       "poste"       "puerto"      "rajan"       "ramones"    
## [121] "ria"         "romano"      "santa"       "santiago"    "santos"     
## [126] "sean"        "soso"        "sudan"       "sus"         "tatas"      
## [131] "tıme"        "titos"       "todo"        "toro"        "tosa"       
## [136] "ufos"        "usas"        "uta"         "vargas"      "verde"      
## [141] "vid"         "yaya"

The process could be refined,m excluding names, explicit misspellings and abbreviations and maybe referring to the other dictionaries, accounting for frequency counts.

Increasing the coverage

One possible way to “manually” increase the coverage, is the replacing words in the corpus with their synonyms, selecting just the most frequent. Here are the top words synonyms (nouns)

syn freq
love love,dear,passion,honey,beloved 231
good good,goodness 206
one one,single,ace 184
can can,ass,behind,john,butt,toilet,seat,bottom,potty,throne,tin,bathroom,pot,stern 169
time time,sentence 163
back back,cover 133
need need,want,demand,motivation 109
make make,brand 105
people people,mass 104
going going,release,loss,exit,leaving,passing,expiration 98
want need,want,wish,wishing,lack 98
night night,dark 88
come come,seed 86
hope hope,promise 86
last last,end,close,death,finish,finale 86
work work,study,workplace 78
life life,living,lifetime,spirit 75
game game,biz 67

There is also a problem of not all the words from the corpus contain in the standard dictionary, even if we exclude the names, foreign words, typos, etc.

The possible solution could be:

  • Include in the extended dictionary all the findings

  • Correct typos, using relevant typos dictionary or distance measure

  • Use one of the subject related dictionary, e.g. twitter slang or abbreviations

  • General “standardization” of the dictionary, “synonyms” approach is naive, but more robust way may be sound (e.g. investigating the word variations)

Modelling

General simplified approach

Basic modelling could be done through application of estimated coefficients in the transition probabilities. Smoothing is applied the most trivial: adding minor value and then normalizing. For the sake of simplicity we provide the code for only the 1-1 transitions

sequences <- lapply(corpus, function(doc) strsplit(as.character(doc), " ")[[1]])

build_transition_matrix <- function(sequences) {
  vocabulary <- rownames(word_freq)
  vocabulary_size <- length(vocabulary)
  word_index <- match(unlist(sequences), vocabulary)
  transition_matrix <- matrix(0, nrow = vocabulary_size, ncol = vocabulary_size)
  for (sequence in sequences) {
    for (i in 1:(length(sequence) - 1)) {
      current_word <- word_index[i]
      next_word <- word_index[i + 1]
      transition_matrix[current_word, next_word] <- transition_matrix[current_word, next_word] + 1
    }
  }
  transition_matrix <- t(apply(transition_matrix, 1, function(row) row / sum(row)))
  return(list(vocabulary = vocabulary, transition_matrix = transition_matrix))
}

model <- build_transition_matrix(sequences)
vocabulary <- model$vocabulary
transition_matrix <- model$transition_matrix

predict_next_word <- function(sequence) {
  current_word <- match(sequence[length(sequence)], vocabulary)
  next_word_probabilities <- transition_matrix[current_word, ]
  
  #smoothing
  next_word_probabilities<-next_word_probabilities+0.0001
  next_word_probabilities<-next_word_probabilities/sum(next_word_probabilities)
  
  next_word <- sample(1:length(vocabulary), size = 1, prob = next_word_probabilities)
  return(vocabulary[next_word])
}

# Example usage: generate predictions for a given sequence
sequence <- c("needs","are")
word_predict <- predict_next_word(sequence)

# Print the predicted next word
print(word_predict)
## [1] "just"

The model for unseen n-grams could be built as assuming it is a typo, hence finding the most close string in the names:

require(stringdist)
sequence <- c("needs","ar1")
sequence<-sapply(sequence,function(seq){
  vocabulary[which(stringsim(seq,vocabulary)==max(stringsim(seq,vocabulary)))[1]]
})

word_predict <- predict_next_word(sequence)

# Print the predicted next word
print(word_predict)
## [1] "just"
  • How can you efficiently store an n-gram model (think Markov Chains)?

The most efficient way is using Matrix(transition_matrix, sparse = TRUE), however I didn’t get enough time to implement it on this iteration

  • How can you use the knowledge about word frequencies to make your model smaller and more efficient?

Firstly, I can group the words by frequency bins and estimate transaction betwen bins and then implement different submodels, like frequent-frequent, etc.

Then I can group the terms by context, leveraging simple language model: then transition firstly takes the next “cluster” and then picks up a representative. This should be memory-friendly

Group words by synonyms, reducing words quantity

Reduce the number of rare combinations of n-grams, adding random choice instead (simulating using the probabilities derived earlier)

  • How many parameters do you need (i.e. how big is n in your n-gram model)?

In the 2 gram approach, I need n^2 parameters, which are the transition probabilities, if talking about 2gram-2gram transaction, n_2gram^2 will be needed.

  • Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?

Yes, I applied “bounce from zero” approach, in perfect world smoothing like Linear interpolation could be applied to replace, say, rare 3 gram, by weighted 1 and 2-grams.

  • How do you evaluate whether your model is any good?

The evaluation could be done by standard approach: cross-validating using train/test set separation

  • How can you use backoff models to estimate the probability of unobserved n-grams?

If the 3-gram entered by user is not present in the data, I can either discard first word in it and go with prediction by 2-gram residual, or I can predict using the third element by first two and then predict using the second pair, or the whole 3-gram, if it appears present.

Conclusion

We have described the general features of one of the dataset and suggested the most trivial model for the sake of concept evaluation.

Other features like extensive testing, scaling and adjusting for rare events, 2-3 grams implementation will be done in a course of product development.