Next Word Prediction Process

Sateesh Nallamothu
Nov 11, 2017

Problem Description and source data

Natural Language Processing(NLP) is heavily relied on Machine learning algorithms and Statistical Learning Methods. These algorithms and methods take a large set of 'features' as inputs that are generated from input data block. To predict the 'next word', we'll use the dataset provided by SwiftKey corporation. Following are data characterstics and assumptions.

The dataset is a zip file including blog posts, new articles and Twitter tweets. Here are some of the statistics about of data/corpus.
- en_US.twitter.txt - Lines = 2360149 words= 30373583
- en_US.blogs.txt - Lines = 899289 words = 3733413
- en_US.news.txt - Lines= 77260 words= 2643969
Since the data is too big to process with my laptop, I'm using 10% of radom data from each file.

N-gram model using Markov Assumption

Bag-of-words and N-gram model are most commonly used methods to predict the next word in a sentence.

N-gram model depends on knowledge of word sequence from (N-1) prior words. This Model is widely used in NLP processing for word prediction.
The intuition of the N-gram model is the chain rule of the conditional probabilities. Chain rule computing the conditional probability P(a,b,c,d) = P(a)P(b|a)P(c|a,b)P(d|a,b,c) .
Computing the conditional probabilities and multiplying them is a lot of process. So we simplify this using Markov assumption.
We assume the probability of next word only depends on its previous word or previous 2 words (or N-1 words). This is called Markov assumption which helps us predicting the future word without looking too far into the past
So P(w20|w1,w2…..w18,w19) will become P(w20|w18,w19).

Smoothing using Knese-Ney process

The N-gram probabilities are estimated by using Maximum Likelihood Estimate (MLE) assumption which uses counts and normalization process. i.e P(word2|word1) = count(Word1,word2)/count(word1).
N-gram model is a static model and it depends on the training corpus. The model will become better and better as we increase the N.
But not all words can be accommodated in a model. In order to handle the words that are not in training but in test sentence, we use 'smoothing or discount techniques'.
There are wide veriety smoothing techniques like add-1(laplace), add-k, Stupid Backoff and Kneser-Ney smoothing. -One of the most commonly used and the beset performing is Kneser-Ney smoothing. We will be using this in our model.

Model building

As text word will consume more space, I'm converting the words into numeric values (i.e word to index) and counting the N-gram based on the numeric representation of the words.
As Python has a lot of NLP processing packages to create N-grams, compute Kneser-Ney smoothing etc, I'm relying on Python code to create N-grams and Kneser-Ney data. The Python output data will be used in R for futher processing.
After clenaing and tokenizing the corpus, I've created bigrams and Kneser-Ney smoothing with tri-grams. We can predict the next word by following below steps:
- Get a text as input with two words/tokens so that we can start our search with 3-gram Kneser-Kye table. Search the 3-gram table and find a matching terms for the given input string/text. If one or more matches are found, then the algorithm outputs the top predictions for the next word given those three terms.
- If no match is found in the 3-gram table, then the search continues in the bi-gram table using the last word from the input. If no match is found, the prediction will then use random 4 tokens from one-gram table.