NGram Predictive Modelling

Siyang Ni

Overview

This Project is the capstone project for the Data Science Specialization from John Hopkins University. In this project, I demonstrate how to build a simple model that predict the next words according to the previous one or few words.

NGram Algorithm

N-gram models are a type of simple language model that assign probabilities to sequences of words, and are a common approach to language modeling. The letter “n” in n-gram stands for number, and can be replaced with any value. For example, a 1-gram is a single word, a 2-gram is a two-word sequence, and a 3-gram is a three-word sequence.

  • 2-gram: “This is” or “is a”

  • 3-gram: “Is a great”

  • 4-gram: “She stood up slowly”

What I Have Done

  1. I sample a small subset of the data (length = 200). There are two reasons:
    1. This is a demonstrative app. NGram is out-of-date in a productive environement
    2. Corpus length beyond 200 will result in out-of-memory problem for running the app on the Shiny server free tier.
  2. I cleaned the corpus

What I Have Done (Cont.)

  1. I tokenize the corpus and than create unigram, bigram, and trigram.
  2. I built a simple Markov model where the probability of the next word is determined entirely by the previous word(s). The Markov model has a backoff wrapper where the number of contextual words is decided by the number of words in user input.

How It Performs

The app runs well, and is super simple to navigate. You can toggle the number of contextual words and enter your own sentences/words on the left-hand side, and the right-hand side will give you the three possible predicted sentences, and also tells you how many contextual words are actually used for the prediction.

How It Performs (Cont.)

It will tells you there is no match found when you enter words/phrases that are not in the training data, as it is shown below. Given the limitation of the platform (Shiny free tier only accepts extremely small apps), we don’t have many training words. However, as long as the words/sentences you enter has been trained on, the app does what it is designed to do.

How It Performs (Cont.)

The App UI

If You Love NLP

I hope this course and my little demonstration takes you into the world of Natural Language Processing (NLP). NLP itself deserves a two-semester course, but this is a good point of departure.

NGram is smart and extremely computationally cheap. It is one of the traditional language modelling algorithm that emphasizes analytic visibility (meaning we can mathematically derive how it works) and computational efficiency. However, even if we dramatically increases the training set size and the number of grams, the nGram model’s performance caps quickly.

If You Love NLP

Nowadays, predictive language models rely on deep learning models, some that I recommend you to learn are:

  • Recurrent Neural Network (especially LSTM)

  • Autoencoders

  • Transformers

Lastly, my personal advice is to use Python to do anything NLP, because you’ll very likely need to use Tensorflow or Pytorch as you go on in your learning.