Next Word Prediction Tool

A R Shiny application that predicts the next word

JiaHsuan Lo
Coursera Data Science Student

Introduction

Automatic word completion and next word prediciton are common features in the user input devices that can help users input texts easier, faster, and more accurately.

This application is designed to provide a convenient and efficient typing tool to user. The underlying algorithm was based on N-gram language model and Kneser-Ney smoothing algorithm. Some backgrouhd information can be found at the following links:

Data used to Build the Language Model

The data is from a corpus called HC Corpora www.corpora.heliohost.org, and the data files are obtained from the Coursera website.
The sources of texts include blogs, twitter, and news. Only US english files were used. The lines and words counts are summarized in the following table:

File Name	Number of Lines	Number of Words
en_US.blogs.txt	899288	38315977
en_US.twitter.txt	167155	2191565
en_US.news.txt	1010242	35627434

To make the app small and fast for general PC and mobile device, following sampling steps were used when building the model:
- 6% of the texts from each files were strafified-randomly sampled. Then 1- to 4-gram frequency table were created.
- From 1-gram data, 5000 most common words were identified based on the frequency.
- Only those n-gram data (n=1...4) that contains the 5000 most common words were preserved.

Algorithm Behind the Scene

Automatic word completion:
- 1-gram model was used to predict more probable word.
Next word prediction: A mixed smoothing algorithm was used:
- Kneser-Ney Recursive formula was applied to 1- to 4-gram language models to calculate the probabilities of the next word candidates.
- 2- and 3-gram models were also used to estimate the probabilities of the next word candidates.
- An interpolation scheme was used to calculate the final probablities.
- The most probable word was then predicted.

Application Screenshot

The application can be access at https://jiahsuanlo.shinyapps.io/NextWord/