Data Science Capstone Project - Predict Next Word using N-gram Lookup Tables

Lynn Huang
September 13, 2017

Introduction

The objective of the Data Science Capstone project is to build a data product that predicts next word when a smartphone user enters a text string (e.g. an incomplete phrase, or an incomplete sentence). An example of such application in commercial products is SwiftKey. The data product will be running on Shiny.io, a web application deployment environment. The data product shown in this project is by no means an impeccable product, but rather a simple application to show my efforts of learning and applying data science knowledge to solve a practical problem in Natural Language Processing (NLP).

Data Exploration and Preprocessing

The data provided by the Capstone project is a large corpus of text documents containing three files of blogs, news and twitters, all in English language, which has more than four million lines and more than 100 million words in total. Due to the machine's resource limitation, 10% of the corpus was sampled in this project.

In order to make better prediction, the following data cleaning steps (transformations) are applied to the text.

conversion to lower case
removal of non-English characters
removal of numbers, punctuations,stop words
removal of profinity words

Creation of N-gram document feature matrix (dfm)

According to Wikipedia, N-gram (e.g. unigram, bigram, trigram, quadgram) is a contiguous sequence of n items from a given sequence of text or speech.

An intermediate step to tokenize the corpus while performing the above mentioned data cleanning was taken before an n-gram document frequency matrix (dfm) was generated (which holds frequencies of ngrams in the document). The intermediate step has been observed to improve the speed of n-gram generation dramatically. Matrixes with bigram, trigram and quadgram with frequency > 1 (more than one occurance) are created and saved to disk for future manipulation. These ngrams (1-4 grams) will become system dictionary for the next word prediction.

Algorithm of Predicting Next Word

A simple back-off model is used to predict the next word.

If the text entered after data cleaning has more than three words, the model will try to match the last three words in the quadgram lookup table. If there is a match, the fourth word of 4-gram phrase with the highest frequency in the quadgram lookup table will be returned as the next word predicted. If such match does not exist, the model will back off to n-1 gram, the trigram lookup table. The last two words will be matched in the trigram lookup table. If the match is found, the third word in the 3-gram phrase with highest frequency in the trigram lookup table will be returned. If there is no such match, the model will back off to bigram searching with the same procedure. If no match found, the system will return a message as “can not predict next word”. When the text with two words after the data cleanning, the model will start with trigram, then backoff to bigram if necessary.

The text entered are transformed using the same cleaning methods such as conversion to lower case, removal of numbers, punctuation, stopwords just like the corpus. In order for the match algorithm to work, the ngram matrix is transformed to a data table where n-gram (n-word) were split and converted to first n columns, with its frequency as the last column (column n+1).

Shiny App Demo

The Shiny App is running on

https://mlynnhuang.shinyapps.io/capstone-final/

User can enter a text string in the text box on the left panel.

On the right panel (main panel),underneath the header of “WORD / TEXT / SENTENCE ENTERED:”, the text entered by the user (before data cleanning) is displayed in red.

Underneath the header of “SEARCHING N-GRAMS TO SHOW NEXT WORD: ”, if there is a match, a message with specific ngram used will be displayed in red, e.g, “Trying to Predict Next Word Using Bigram”. If no match, “can not predict next word” message will be displayed.

Shiny ui.R and server.R can be found in the below link

https://github.com/mlynnhuang/Data-Science-Capstone-Project