Next Word Prediction (NEWPRE)

March 2019

The Problem

Project Requirement

In this project, we are going to build an Shiny app, which can provide several reasonable next word choices for user when they type some text. The suggestions are the prediction result based on the previous input.

Available Data

A dataset is given for training purpose. The dataset includes three categories of content, which are grabbed from blogs, news, and Twitters. Each of them are in four different languages respectively, which are US English, Germany German, Finnish, and Russian. For simplicity reason, only US English will be used as training data in this phase.

The Solution

Method used

In general, based on the last two words \((w_{i-2}, w_{i-1})\) of the input text, the program tries to predict the next word \(w^{\prime}_i\) by finding a 3-gram \((w_{i-2}, w_{i-1}, w^{\prime}_i)\) with the highest probability among all the \(w_i\). In order to find out the probabilities of all the possible 3-grams in the model, we need to explore 3-grams which are not seen in the training dataset, and assign them some reasonable larger than zero probabilities for comparison. The method adapted in the app is to build the prediction model based on Katz backoff 3-gram modeling algorithm, which in turn uses Good-Turing discounting to calculate the probabilites for the backoff algorithm.

Katz backoff model

In the Katz backoff N-gram model, if the count, so as the probability, of an N-gram we need is zero, we then back off the estimation to the (N-1)-gram, which is to approximate the probability of the N-gram by using the (N-1)-gram probability. This method requires us to first discount part of the probabilities from the seen (N-1)-gram and then distributed to those unseen N-grams. The discount algorihtm used here is Good-Turing discounting.

Good-Turing discounting

The intuition of Good-Turing is to estimate the count of unseen N-grams by the count of N-grams which appear once. More details can be found on chapter 4 of Speech and Language Processing, 2nd ed. by Dan Jurafsky and James H. Martin.

Usage

User just simply type some text in the input box [1], and end with a space or enter a new line. The program will give you 5 most probable next words based on the input, with corresponding probabilities shown [2].

Location of input box and probability chart

The available number of predictions can be set from 1 to 10 in the Number of predictions box [3].

Features

User can choose what type of prediction models the application makes use of. The application can make predictions based on model built on corpus of blog, news, Twitter, or any combination of them [1]. Changing model does take a bit of time as each change requires loading of the model, which the size is big, and some backstage calculations.

Panel to change corpus Changing corpus