Building a Markov n-gram Language Model

Leonardo Pinheiro
12/12/2014

Introduction

This presentation goes over the creation of an shiny widget capable for word predictions.

The specifications are as follows:

  • Development of a shiny app that the can predict the next word to be digited.
  • Data source is a database of english sentences from twitter, blog and news articles.
  • Based on markov n-gram models.

Data Insight

Example text and 3-gram.

twitter[1]
[1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
trigram[1,]
   X0  X1   X2 Frequency
1 way too long        10

Theoretical model

Markov Language model - predicts next word based on last seen n-gram

markov

Katz Back-off Model use lower n-gram if the n-gram in question is not observed on data.

Data processing and modeling

  • N-gramization using Python NLTK Package
  • N-grams are stored in frequency data frames to check for probabilities
  • Data is stored in a SQLite database to save memory usage
  • A function implements Katz back-off model checking frequencies for quadrigrams, trigrams, bigrams and unigrams in this order.
  • The app returns the three most like words given by the markov model.

Conclusion and improvement Considerations

  1. Use more data. The app was built using a sample of data consisting of 550.000 sentences extracted from given database.

  2. Have user especificy data. The vocabulary of a specificy individual could have more predictive power than random text.

  3. More advanced models (Ex.: Linear interpolation).

  4. Check up the app and enjoy!