Word Prediction

Kishore Mamidi
June 16, 2017

An application to predict next word

Introduction

The goal of this application is to build a model using natural language processing tools that will predict the next words given an input of logical words in a sentence.

There are many uses for this application including

  • predictive keyboards in smartphones
  • autocomplete suggestions in search engines
  • improve translation accuracy

This application was developed as part of the capstone project for the data science specialization offered by John Hopkin university via Coursera. To learn more about the project, visit course page

Data Preparation

This app uses data from sample blogs, news articles, and tweets that were downloaded from the course repository

Following cleaning operations were performed on the raw data:

  • Remove non-ASCII characters
  • Convert all text to lowercase
  • Remove URLs, numbers, leading and trailing whitespaces
  • Split text into sentences

Once data was cleaned, n-grams (n = 1 to 5) frequency tables were generated based on tidytext package. Since term frequencies follow Zipf’s law, n-grams with single frequency were pruned to reduce data size, and improve performance

Prediction Algorithm

This application uses a 5-gram probabilistic model and applies the Stupid Backoff algorithm to rank next-word candidates.

The Stupid backoff algorithm can be summarized as follows

Stupid backoff

The Stupid Backoff implementation in this app starts by using upto the last four words typed in, and tries to find 5-grams that complete those four words. If less than the max defined predicitions are found, then the algorithm proceeds to match the last 3 words in 4-grams library and so on, until it has found the defined number of results to return.

Using the Application

The word prediction application can be accessed here. To use the app

  1. Wait for the message to say the 'App is ready to use'
  2. Select the maximum predictions you wish to see (default is 5)
  3. Enter sample text in text box, click button 'Get Next Words'
  4. The predictions will be shown in the right pane

In a future iteration, the app can be further optimized by

  • using Kneser-Ney algorithm instead of Stupid backoff
  • precomputing all necessary scores based on n-gram frequencies