Predict the next Word

Guillaume Polet
28/12/2018

Summary

This deck of slides aims at presenting the << next word prediction >> model and shiny app. The model is built on basis of English Language data sets “Blogs”, “News” and “Twitter” given by Coursera John Hopkins UNiveristy teahcing team. Thhis model is part of the capstone project of the data science specialization on Coursera and consisis in building predictive model in the field of Natural Language Processing as well as a front end Shiny Application.

What the data product is doing :

Front-End:

Based on the user input phrase/word, the app is diplaying the 5 words with the highest probability of the next word.

Back-End:

Based on the input phrase, the model built (explained later in the slides) commputes the probability of the next word and retrieves the most likely.

Process

Preprocessing , Cleaning and Tokenization activities on the data sets using mainly �quanteda� package in R.
Exploratory analysis over tokenized text was performed.

Modeling piece is based on Katz's back-off trigram algorithm which main concept is to estimate the conditional probabilty of the given word given its history:

The process in a nutshell:

Creation of Unigram, Bigram,Trigram and Fourgram Tables.
Good-Turing smoothing with discount parameters for bi-, trigrams and four-grams.
Deriving probabilties for words in observed instances and for those that possible can complete the not observed n-grams.
The Predicted word is the one with highest probabilty

Process in details

Based on the n-grams tables the algo first tries to predict the next word based one the longest possible history. If not found in the data used to train the model, then the algo relies on shorter history and if not on shorter and shorter history. Predictions are then made based on different lengths of history, and the probabilities for these predictions are weighted based on the Katz back-off alpha. It is possible that a prediction based on a shorter history is more Katz-likely than a prediction based on a longer history. To predict the algo is based on a Katz back-off method (built on a 3-gram).

How does it provide a prediction? By using the info from the last 2,1 or none words (it depends on how many words the user has typed in) In the case of the most recent 2 grams words is not found in the highet ngram (tri), then the model looks for most recent sequence of 1 words in the lower level 2-gram, and so forth.

Besides that, a Good Turing smoothing with k=5 has been used to calculate the frequency of appearence of a word in the data used to built the algo given its history. This is called the Good Turing Frequency.

What does it consist in? The GT algorithm spreads some probability mass from ngrams that appear less often, max of 5 (k) times, in the data to ngrams that appear zero times. This process is called smoothing of probabilities in the NLP literature. As a result the GoodTuring Freq is always equal or smaller than the original frequency for words that appear at least once in the data used to build the model. As a natural next step conditional probabilities of a word appearing given its history are calculated based on the GoodTuring Frequencies. These probabilities are called (GoodTuring Prob) in the tables provided on Main and Details tabs. A word that is not predicted by a higher level ngram, but is predicted by a lower level ngram is further discounted by Katz Alpha, to derive its (Katz Prob). All predicted words by all 4,3,2,1 ngrams are then sorted by Katz Prob

(for the description of the process, inspired by N.Dobrinov)

The shiny App and how to use it

The shiny app is available here: https://gpol93.shinyapps.io/capstoneProject/

The user gives a word/phrase in the input box, click on predict
The results are displayed in the box beneath.

Next milestones and improvements

To improve accuracy, using more data (model based on less than 1% of the data, maybe a cloud solution?)
Using words2vec model to get similar words and better predictions
Improve the shinyApp by adding some features