Intro

This is the first report paper in the course of Data Science Capstone project. The project is about prediction of the next word given user input of the context. The goal of this document is to convey the results of the exploratory analysis of the test datasets, provided along with the assignment, and to formulate the approach for modelling the prediction. The latter will be used in the development of the app, delivering real-time prediction.

Exploratory analysis

Overview of the training datasets

For the sake of limiting the scope and reducing the runtime, we have limited the input source to twitter, under fixed locale en_US, randomly feeding 5000 lines of source corpus.

First step is to construct a corpus, using the package tm.

The total number of lines is 5000, having the 4.7945^{4} words, among which 9816 are unique.

Short descriptives for the frequencies and most frequent words are as follows.

word	freq	word	freq	word	freq	word	freq	word	freq	word	freq
the	1915	from	180	still	95	4 items	57	8 items	29	6147 items	1
you	1151	today	175	thank	90	looking	56	9 items	28
and	886	know	174	had	89	2 items	55	9 items	27
for	799	can	169	night	88	2 items	54	10 items	26
that	533	time	163	2 items	87	4 items	53	11 items	25
with	397	how	162	3 items	86	3 items	52	10 items	24
your	370	now	157	been	85	2 items	51	9 items	23
have	366	new	154	best	83	3 items	50	14 items	22
this	340	great	153	tonight	82	5 items	49	14 items	21
just	319	see	152	she	81	3 items	48	16 items	20
but	305	our	148	2 items	80	3 items	47	18 items	19
are	297	they	140	3 items	79	5 items	46	19 items	18
its	281	would	136	work	78	4 items	45	15 items	17
all	279	lol	134	off	77	8 items	44	22 items	16
out	258	back	133	life	75	5 items	43	25 items	15
what	245	there	129	then	73	4 items	42	38 items	14
get	241	too	128	3 items	71	4 items	41	45 items	13
like	240	some	122	2 items	70	7 items	40	38 items	12
2 items	238	more	117	should	69	3 items	39	49 items	11
love	231	cant	116	2 items	68	4 items	38	60 items	10
good	206	happy	115	2 items	67	4 items	37	64 items	9
thanks	204	need	109	3 items	66	3 items	36	89 items	8
will	192	3 items	108	3 items	64	7 items	35	122 items	7
about	191	2 items	105	their	63	2 items	34	174 items	6
day	189	people	104	3 items	62	9 items	33	239 items	5
when	188	who	101	way	61	9 items	32	353 items	4
dont	185	2 items	100	3 items	60	4 items	31	608 items	3
one	184	4 items	98	2 items	58	7 items	30	1348 items	2

There is a theoretical model behind these stats shown as a solid line in the graph above, and as the modelling shows, the words freqs could be approximated by a power function count ~ 1902.51*word_rank^-0.73

Frequencies of 2- and 3- grams

For the 2-gram frequencies, the summary of the observations is as follows:

word	freq	word	freq
in_th	155	5 items	35
for_th	152	4 items	33
of_th	116	3 items	32
to_b	101	3 items	31
2 items	98	this_i	30
i_hav	91	3 items	29
at_th	90	4 items	27
thank_you	77	7 items	26
i_lov	75	6 items	25
have_a	72	7 items	24
to_th	70	8 items	23
to_get	63	6 items	22
to_se	62	11 items	21
2 items	59	10 items	20
for_a	58	11 items	19
2 items	57	17 items	18
2 items	56	16 items	17
if_you	55	18 items	16
2 items	54	18 items	15
2 items	53	20 items	14
i_just	49	33 items	13
3 items	47	34 items	12
3 items	46	52 items	11
3 items	45	69 items	10
3 items	44	68 items	9
2 items	43	98 items	8
2 items	42	116 items	7
4 items	41	190 items	6
2 items	40	268 items	5
3 items	39	508 items	4
it_wa	38	999 items	3
2 items	37	2948 items	2
3 items	36	32509 items	1

Obviously, for the 3-grams, the distribution is even more uniform, with a very little common data on actually ocuring 3-grams.

word	freq	word	freq
thanks_for_th	51
4 items	20
thank_you_for	18
i_love_you	16
one_of_th	15
2 items	14
3 items	13
a_lot_of	12
5 items	11
3 items	10
12 items	9
8 items	8
19 items	7
27 items	6
48 items	5
125 items	4
237 items	3
1308 items	2
47274 items	1

Words coverage

We investigated the “Ghini curve” for the words, based on the first chart in this section. It appears, that 217 words (2.2%) cover 50% of all the words in the corpus, while 5022 words (51.2%) unique constitute 90% of all words

Words in other languages

The most straightforward way is to check the non-english words for presence in any other dictionary. For this purpose we applied package “hunspell”, however several commonly used words, names and acronyms appear to be missing from the typical en package, while present in the others. Nevertheless, there is quite natural presence of foregin words observed, namely:

German:

##  [1] "matt"   "los"    "albert" "dem"    "http"   "las"    "allen"  "dir"   
##  [9] "phil"   "tue"    "acta"   "adele"  "andres" "artest" "das"    "fing"  
## [17] "holdem" "reich"  "wer"    "zimmer"

And Spanish:

##   [1] "san"         "vegas"       "cuz"         "florida"     "dallas"     
##   [6] "diego"       "cinco"       "york"        "ron"         "james"      
##  [11] "los"         "usa"         "california"  "carolina"    "dan"        
##  [16] "del"         "indiana"     "las"         "prez"        "seo"        
##  [21] "ala"         "aries"       "bella"       "ben"         "bruce"      
##  [26] "charles"     "cole"        "gilbert"     "leo"         "lisa"       
##  [31] "maya"        "orlando"     "rivera"      "rubio"       "acta"       
##  [36] "ali"         "ama"         "americana"   "arcadia"     "argentina"  
##  [41] "axe"         "ballesteros" "barbados"    "barbara"     "bel"        
##  [46] "bernardo"    "bon"         "bueno"       "caliente"    "capote"     
##  [51] "casa"        "ceda"        "centre"      "cha"         "chacha"     
##  [56] "chica"       "chorizo"     "coa"         "colo"        "colorado"   
##  [61] "conjuguemos" "cumbre"      "das"         "delgado"     "diablo"     
##  [66] "diegos"      "dolo"        "ele"         "espada"      "estela"     
##  [71] "floridas"    "fonda"       "francisco"   "franciscos"  "gagas"      
##  [76] "gaza"        "gigante"     "gis"         "gomer"       "gracias"    
##  [81] "indianas"    "industria"   "jamba"       "jane"        "jazmines"   
##  [86] "junto"       "juvenal"     "latina"      "latinas"     "latino"     
##  [91] "latinos"     "lego"        "legos"       "libros"      "luis"       
##  [96] "lunas"       "mariano"     "mauro"       "maxwell"     "mira"       
## [101] "mirasol"     "mondo"       "mongo"       "monte"       "moro"       
## [106] "musa"        "muslim"      "nazi"        "nene"        "nicaragua"  
## [111] "nidal"       "nin"         "ocho"        "olas"        "pamela"     
## [116] "picos"       "poste"       "puerto"      "rajan"       "ramones"    
## [121] "ria"         "romano"      "santa"       "santiago"    "santos"     
## [126] "sean"        "soso"        "sudan"       "sus"         "tatas"      
## [131] "tıme"        "titos"       "todo"        "toro"        "tosa"       
## [136] "ufos"        "usas"        "uta"         "vargas"      "verde"      
## [141] "vid"         "yaya"

The process could be refined,m excluding names, explicit misspellings and abbreviations and maybe referring to the other dictionaries, accounting for frequency counts.

Increasing the coverage

One possible way to “manually” increase the coverage, is the replacing words in the corpus with their synonyms, selecting just the most frequent. Here are the top words synonyms (nouns)

	syn	freq
love	love,dear,passion,honey,beloved	231
good	good,goodness	206
one	one,single,ace	184
can	can,ass,behind,john,butt,toilet,seat,bottom,potty,throne,tin,bathroom,pot,stern	169
time	time,sentence	163
back	back,cover	133
need	need,want,demand,motivation	109
make	make,brand	105
people	people,mass	104
going	going,release,loss,exit,leaving,passing,expiration	98
want	need,want,wish,wishing,lack	98
night	night,dark	88
come	come,seed	86
hope	hope,promise	86
last	last,end,close,death,finish,finale	86
work	work,study,workplace	78
life	life,living,lifetime,spirit	75
game	game,biz	67

There is also a problem of not all the words from the corpus contain in the standard dictionary, even if we exclude the names, foreign words, typos, etc.

The possible solution could be:

Include in the extended dictionary all the findings
Correct typos, using relevant typos dictionary or distance measure
Use one of the subject related dictionary, e.g. twitter slang or abbreviations
General “standardization” of the dictionary, “synonyms” approach is naive, but more robust way may be sound (e.g. investigating the word variations)

Modelling

General simplified approach

Basic modelling could be done through application of estimated coefficients in the transition probabilities. Smoothing is applied the most trivial: adding minor value and then normalizing. For the sake of simplicity we provide the code for only the 1-1 transitions

sequences <- lapply(corpus, function(doc) strsplit(as.character(doc), " ")[[1]])

build_transition_matrix <- function(sequences) {
  vocabulary <- rownames(word_freq)
  vocabulary_size <- length(vocabulary)
  word_index <- match(unlist(sequences), vocabulary)
  transition_matrix <- matrix(0, nrow = vocabulary_size, ncol = vocabulary_size)
  for (sequence in sequences) {
    for (i in 1:(length(sequence) - 1)) {
      current_word <- word_index[i]
      next_word <- word_index[i + 1]
      transition_matrix[current_word, next_word] <- transition_matrix[current_word, next_word] + 1
    }
  }
  transition_matrix <- t(apply(transition_matrix, 1, function(row) row / sum(row)))
  return(list(vocabulary = vocabulary, transition_matrix = transition_matrix))
}

model <- build_transition_matrix(sequences)
vocabulary <- model$vocabulary
transition_matrix <- model$transition_matrix

predict_next_word <- function(sequence) {
  current_word <- match(sequence[length(sequence)], vocabulary)
  next_word_probabilities <- transition_matrix[current_word, ]
  
  #smoothing
  next_word_probabilities<-next_word_probabilities+0.0001
  next_word_probabilities<-next_word_probabilities/sum(next_word_probabilities)
  
  next_word <- sample(1:length(vocabulary), size = 1, prob = next_word_probabilities)
  return(vocabulary[next_word])
}

# Example usage: generate predictions for a given sequence
sequence <- c("needs","are")
word_predict <- predict_next_word(sequence)

# Print the predicted next word
print(word_predict)

## [1] "just"

The model for unseen n-grams could be built as assuming it is a typo, hence finding the most close string in the names:

require(stringdist)
sequence <- c("needs","ar1")
sequence<-sapply(sequence,function(seq){
  vocabulary[which(stringsim(seq,vocabulary)==max(stringsim(seq,vocabulary)))[1]]
})

word_predict <- predict_next_word(sequence)

# Print the predicted next word
print(word_predict)

## [1] "just"

How can you efficiently store an n-gram model (think Markov Chains)?

The most efficient way is using Matrix(transition_matrix, sparse = TRUE), however I didn’t get enough time to implement it on this iteration

How can you use the knowledge about word frequencies to make your model smaller and more efficient?

Firstly, I can group the words by frequency bins and estimate transaction betwen bins and then implement different submodels, like frequent-frequent, etc.

Then I can group the terms by context, leveraging simple language model: then transition firstly takes the next “cluster” and then picks up a representative. This should be memory-friendly

Group words by synonyms, reducing words quantity

Reduce the number of rare combinations of n-grams, adding random choice instead (simulating using the probabilities derived earlier)

How many parameters do you need (i.e. how big is n in your n-gram model)?

In the 2 gram approach, I need n^2 parameters, which are the transition probabilities, if talking about 2gram-2gram transaction, n_2gram^2 will be needed.

Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?

Yes, I applied “bounce from zero” approach, in perfect world smoothing like Linear interpolation could be applied to replace, say, rare 3 gram, by weighted 1 and 2-grams.

How do you evaluate whether your model is any good?

The evaluation could be done by standard approach: cross-validating using train/test set separation

How can you use backoff models to estimate the probability of unobserved n-grams?

If the 3-gram entered by user is not present in the data, I can either discard first word in it and go with prediction by 2-gram residual, or I can predict using the third element by first two and then predict using the second pair, or the whole 3-gram, if it appears present.

Conclusion

We have described the general features of one of the dataset and suggested the most trivial model for the sake of concept evaluation.

Other features like extensive testing, scaling and adjusting for rare events, 2-3 grams implementation will be done in a course of product development.

Word prediction Milestone Report #1

YK

2023-06-04