Capstone: Next Word Prediction

author: TZiegler date: July 2016 autosize: true

Introduction

This presentation is a brief description of a shiny application for predicting the next word of a sentence using Machine Learning (ML). The project is in cooperation with SwiftKey (http://swiftkey.com/en/) in the area of Natural Language Processing (NLP).

The Data Science/ML Process: - Obtain and understand the data - Data sampling, cleaning and processing - Applying Machine Learning algorithms - Building a Shiny Application for next word prediction

Major R packages used: “quanteda”, “stringi”, and “data.table”

Data sampling, cleaning & processing

Demo data (blogs, news and twitter) were used as word library for prediction. For optimal use of memory storage and prediction speed, subsets of the three data sets (about 10% each) were merged into one corpus. The data cleaning involved separating into sentences, converting to lower case, removing punctuations & swear words.

N-grams are the basis of the word prediction application. Therefore, the next steps were: - Creating four sets of word combinations (n-grams): 4-words, 3-words, 2-words and 1-word - Calculating the cumulative frequencies of the for n-grams - Filter out low frequency n-grams (singeltons) to reduce the library size for optimum performance - Saving the final library file as R-Compressed file (.RData file) - Generating a separat 1-grams file for usage in word completion

The Prediction Model: Kneser-Ney Smoothing

The probability of the occurence of the next word in a sentence can be computed from the previous words. To predict the next word of a sentence, an algorithm looks for all n-grams with the first (n-1) words matching the last (n-1) words of the sentence. The most likely next word is then predicted as the last word of the n-grams (n=2..4) that has the highest weighted frequency.

Two algorithms were tested, the Naive Bias and the Kneser-Ney Smoothing. Finally I used the state of the art word prediction algorithm Kneser-Ney Smoohing algorithm for its better predictions.

Detailed formulae applied in the algorithm can be found here:

Shiny Application and Instructions


Shiny

Shiny