January 21st, 2020

KATZ BACKOFF MODEL

My word-prediction model is based on Katz backoff model for trigrams.

Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by backing off through progressively shorter history models under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.

My model takes the two last words (wi-1 and wi-2) to predict the next word (wi). In case, we only have a last word (wi-1) our model keeps working fine. After that, my model shows results for the possible predicted words (wi) in terms of their probabilities.

PREDICTIVE MODEL WORKS

First, I have created my model by using 1% of the dataset (training set). Afterwards, I have taken 100 samples from the dataset (test set) for bi-grams and tri-grams whose the frequency are higher than 10,50 and 70 times in the dataset.

I have run my model to predict the word (wi) giving the history (wi-1 and wi-2) and I have compared the results from my model and the word obtaining the following results :

  • 100 samples (frequences_trigram&bigram>10): Right_words=10,Wrong words=90
  • 100 samples (frequences_trigram&bigram>50): Right_words=40,Wrong words=60
  • 100 samples (frequences_trigram&bigram>70): Right_words=46,Wrong words=54

SUMMARY OF MY N-GRAM MODEL

The more the frequency of the word to predict (wi) the more the accuracy of our model.

I have also observed that the language is completely different if we get the data from twitter than news. Therefore, if we want to increase our accuracy, we should differenciate our data source. I mean, we should create a word-prediction model only for twitter, other model for blogs and another one for news. That way, we manage to increase the performance of our model.

MY PREDICTION MODEL SHINY APP

I have uploaded my n-gram prediction model on Shiny webpage. You can find my app on the following link : https://osreama.shinyapps.io/Word_Prediction/

Where in the “Enter Sentence:” fiedl, you can put the sentence omitting the last word (wi) of the sentence. the sentence can contain more than 2 words but my algorithm only take the last two words (wi-1 and wi-2) and predicted the next word (wi).

After press the button “Show results”, in the plot area, it is shown the 4 highest probabilities predicted word and you can see the whole sentence in the field “Predicted Word with highest probability:”.

MY PREDICTION MODEL SHINY APP (II)

In case you only wanted to see the probabilities for a specific words, you should enable the checkbox “Select specific word prediction” and write your specific words in the fields : Word1, Word2, Word3 and Word4.

I hope you can enjoy my shiny app and be useful.