Building a Markov n-gram Language Model

Leonardo Pinheiro
12/12/2014

This presentation goes over the creation of an shiny widget capable for word predictions.

The specifications are as follows:

Development of a shiny app that the can predict the next word to be digited.
Data source is a database of english sentences from twitter, blog and news articles.
Based on markov n-gram models.

Example text and 3-gram.

twitter[1]

[1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

trigram[1,]

   X0  X1   X2 Frequency
1 way too long        10

Markov Language model - predicts next word based on last seen n-gram

markov

Katz Back-off Model use lower n-gram if the n-gram in question is not observed on data.

N-gramization using Python NLTK Package
N-grams are stored in frequency data frames to check for probabilities
Data is stored in a SQLite database to save memory usage
A function implements Katz back-off model checking frequencies for quadrigrams, trigrams, bigrams and unigrams in this order.
The app returns the three most like words given by the markov model.

Use more data. The app was built using a sample of data consisting of 550.000 sentences extracted from given database.
Have user especificy data. The vocabulary of a specificy individual could have more predictive power than random text.
More advanced models (Ex.: Linear interpolation).
Check up the app and enjoy!