Next Word Text Prediction

Michael J. Pfammatter
2015-08-23

Data Science Capstone Final Project

John Hopkins University
Data Science Coursera Specialization Track

Introduction

Next word text prediction is important because it can:

save time
make typing easier
help with correcting spelling
help to correct grammar

This is becoming more and more important as mobile devices get smaller and include only screen-based keyboards that can be limiting as far as entering text.

This proof-of-concept app will use the Kneser-Ney smoothing algorithm to predict three words the user is most likely need to type and provide an easy way to add the correct word to their text.

Prediction Algorithm

Kneser-Ney smoothing is a method of text prediction that uses absolute discounting to predict the next most likely word.

\[ P_{abs}(w_{i} \mid w_{i-1})= \frac {max(c(w_{i-1}w_{i})-\delta ,0)} {\sum_{w'}c(w_{i-1}w')} \: + \alpha P_{abs(w_i)} \]

The common example of this is “San Francisco”. If we were predicting by word count alone, “Francisco” might have a high probability as it gets used often in writing. Kneser-Ney smoothing recognizes that it is a frequent word but only when following the word “San” and therefore lowers the probability when following other words.

Proof-of-concept App

Visit the What's up next? app located at https://gnomesoup.shinyapps.io/shiny

The app searches a sqlite database of words and their frequency counts from the corpus of texts collected from blogs, twitter, and news articles. These frequency counts are used to calculate the Kneser-Ney Smoothing prediction. The words that are scored highest on a set of tetra-grams (or set of four words) are then returned. If the tetra-gram cannot be found, we then “back-off” to a tri-gram (set of three words) search. Finally, if no bi-gram exists we search for the best match among bi-grams (set of two words).

"What's up next?" App Instructions

As you type in the open text box, the app will provide suggestion for the next word. If a presented word is what you are looking for, click it and watch the word get added to your text.

Once you are finished typing, click the “Select all text” and hit Ctrl-C (windows) or Cmd-C (mac) to copy the text to the clipboard. If you prefer, click “Save as text file” to download a “.txt” version of your words.

Go ahead! Start using the app

Resources

Here are a list of resources used to create this app.

Corpus data was collected from the HC Corpora Collection

Natural language Processing information was gathered from

The app was written in the R Language using the RStudio environment and the dply, knitr, readr, shiny, shinyjs and stringr packages.