Next Word Prediction

August 2015

Data Science Capstone

Johns Hopkins Bloomberg School of Public Health

Final Project Presentation

What is Next Word Prediction (NWP) Software?

The aim of this kid of software is to predict the next word while user is entering text. It is frequently used for mobile devices. The idea is to improve text typing speed. Prediction is achieved by analyzing large sets of real text. For instance: blogs, webpages, twitter and a wide variety of social network text. For this project the corpora of text is

HC Corpora

This data contains info from Blogs, News and Twitter.

Text data is usually proceeded with Text Mining and Natural Language Processing Techniques.

Initial Data Processing

  • Data obtained have a lot of unnecessary characters to be removed of fitted for statistical processing. Data cleaning of corpora includes removing of: numbers, punctuation, stopwords (words which are filtered out before or after processing of natural language), profanity based on language dictionary.
  • Data cleaned is grouped according their frequency of individual words, pairs of words, etc. This is called N-Gram Tokenization
  • Frequency tables are optimized for space saving and subsequent use.
  • Frequency tables can be used for direct searching or model training.

Algorithm of NWP

There are several algorithms for NWP. Due shinny.io server space and resource restrictions, we use simple Backoff Algorithm:

As user inputs text, we get the last N-1 (N is number or typed words) words and search at NGram for coincidence for highest frequencies. If there is no coincidence, the algorithm search at N-1 Gram, if not coincidence yet search at N-2 Gram and so on in a recursive way until get a result. If not result is obtained after search at all NGrams then the most “popular” word form 1Gram is returned. This implementation uses 4,3,2 and 1 NGrams and samples of original data. This reduction is done for implementing app at free account of shinny.io.

The program can be improved by using training methods in order to find not consecutive words and by using full corpora for making the Ngrams

The App

The app using is so simple: Just type your text at textbox (same cleaning process of corpora will be applied). The result string for search will be displayed as well as predicted Next Word.

App

App can be run at here

I hope you like it!.