Capstone: Next Word Prediction

TZiegler
July 2016

Introduction

This presentation is a brief description of a shiny application for predicting the next word of a sentence using Machine Learning (ML). The project is in cooperation with SwiftKey (http://swiftkey.com/en/) in the area of Natural Language Processing (NLP).

The Data Science/ML Process:

Obtain and understand the data
Data sampling, cleaning and processing
Applying Machine Learning algorithms
Building a Shiny Application for next word prediction

Major R packages used: “quanteda”, “stringi”, and “data.table”

Data sampling, cleaning & processing

Demo data (blogs, news and twitter) were used as word library for prediction. For optimal use of memory storage and prediction speed, subsets of the three data sets (about 10% each) were merged into one corpus. The data cleaning involved separating into sentences, converting to lower case, removing punctuations & swear words.

N-grams are the basis of the word prediction application. Therefore, the next steps were:

Creating four sets of word combinations (n-grams): 4-words, 3-words, 2-words and 1-word
Calculating the cumulative frequencies of the for n-grams
Filter out low frequency n-grams (singeltons) to reduce the library size for optimum performance
Saving the final library file as R-Compressed file (.RData file)
Generating a separat 1-grams file for usage in word completion

The Prediction Model: Kneser-Ney Smoothing

The probability of the occurence of the next word in a sentence can be computed from the previous words. To predict the next word of a sentence, an algorithm looks for all n-grams with the first (n-1) words matching the last (n-1) words of the sentence. The most likely next word is then predicted as the last word of the n-grams (n=2..4) that has the highest weighted frequency.

Two algorithms were tested, the Naive Bias and the Kneser-Ney Smoothing. Finally I used the state of the art word prediction algorithm Kneser-Ney Smoohing algorithm for its better predictions.

Detailed formulae applied in the algorithm can be found here:

Shiny Application and Instructions

The user enters a sequence of words in the text box
The input text is cleaned and preprocessed
The last (up to 4) words of the input text are extracted
Based on the n-gram tables the next most likely word or word completion is predicted
The word can be inserted by clicking the corresponding button

Shiny