Next Word Prediction App

Data Science Specialization Capstone Project

Johns Hopkins Bloomberg School of Public Health

Author: Abiyu Giday
Date: April 22, 2016
http://abiyu.github.io

Agenda

Motivation
Background
Statistical Modeling
Maximum Likelihood & Smoothing
Data Cleaning steps
Application features
Application Usage
Image Reference
Reference

Motivation

Develop a data product that predicts

the most likely next word in English sentence

based on preceding word.

Background

The product is developed using an English corpus( large set of text) that contains news articles, blogs and tweets.
The combined corpus, before cleaning, contains 4.3 Million lines and 102.4 Million words. (big data territory.)
There are about 1 million words in the English language as of Jan 1, 2014*.
Everyday conversation requires roughly 2500-3000 actively used words to communicate effectively.(with passive vocabulary words not used often, totaling about 15k-20k.)
Naturally spoken English words have high correlations. Which means the probability of the next word in a sentence is dependent on the preceding consecutive words.

Statistical Modeling

Naturally spoken English words have high correlations. Which means the probability of the next word in a sentence is dependent on the preceding consecutive words.
Because english language words are highly correlated, the probability of seeing any word W given the previous words w1, w2….Wn-1 can be tabulated with Markov assumption[Martin Christian Körner].
As an example, given a sentence “the quick brown fox”, the probability of “fox” following “the quick brown”, can be tabulated as follows.

Maximum Likelihood(ML) and Smoothing for Missing tokens

n-grams (word tokens) and their respective frequencies of occurance were tabulated from the sampled data (those that were seen in the corpus), Good-Turing discount smoothing technique was applied to fill the gaps for value for unseen tokens words, however due limitation on my home PC were left out (permutations words in unigram just was too much data to process with home PC).
After stacking the n-gram tokens, backoff back off technique was applied to prioritize the trigram then bigram finally the unigram. (A lambda value is selected after few trials for best outcome.)

Data Cleaning and preparation steps

How the Application work

The data would have been indexed and listed in the order in which they are most frequently used. The application will return the most frequent use of the word in a sentence in naturally spoken English..
For Example if there are 10 sentences that start with 'I love ' the 3 that will be returned will have been the once that come after 'I love ' example 'I love you' or 'I love him' or 'I love her' vs 'I love cars' or 'I love cows' etc.
The application go one step further and return the searches based on the position of the word in a sentence as it is frequently used.

Application Feature

Application Usage

High level End to End view

The app is deployed on shinyapps.io and can be found here.

Libraries used to develop the product

library('tm')
library(openNLP)
library(stringi)
library(dplyr)
library(stringr)
library(qdap)
library(shiny)
library(shinythemes)
R version 3.2.3 (2015-12-10)
Unix command line functions with regex to help with some of the cleaning.

Challenges developing the product

The combined corpus, before cleaning, contains 4.3 Million lines and 102.4 Million words. (This is big data territory.) Because the model was built on home computer, that is limited in processing/memory, a very small sample(1%) was used and conditional probability statistical model is applied to predict the next word.
Some of the packages were too slow to create n-grams and other were too slow to clean the data, for this UNIX was used.
Even with 1% of the corpus the trigram/bigram Good-Touring smoothing for unseen data explode beyond the means of personal computers to process.

Image Credit

Conditional Probability with Markov Assumption formula image (slide 5) Martin Korner: Introduction to Knesner-Ney smoothing on Top of Generalized Language Models for Next Word Prediction.
Maximum Likelihood and Example (slide 6) Martin Korner: Introduction to Knesner-Ney smoothing on Top of Generalized Language Models for Next Word Prediction.
Word Cloud (slide 2) - Abiyu Giday
High Level (slide 3 & 9) - Abiyu Giday
Data Cleaning Steps (slide 7) - Abiyu Giday
Application features (slide 9-11) - Abiyu Giday

References

John Fry - Lecture slides - Bigrams and Trigrams, Boise State University.
Martin Christian Körner - Implementation of Modified Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction. Universitat Koblenz Landau
Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, and Wouter van A t t ev el d t, RTextTools: A Supervised Learning Package for Text Classification
William A. Gale and Geoffrey Sampson - Good-Turing Frequency Estimation Without Tears.http://www.grsampson.net/AGtf1.html
Kurt Hornik and Duncan Murdoch - Watch Your Spelling
Kurt HOrnik - RWeka Odds and Ends
Dept. of Linguistic 2009 - Smoothing - Indian University
Hadley Wickham - stringr Vingette - Character String Processing Facilities

Here is the link For the Next Word Prediction.

Next Word Prediction App

Data Science Specialization Capstone Project

Johns Hopkins Bloomberg School of Public Health

Agenda

Motivation

Develop a data product that predicts

the most likely next word in English sentence

based on preceding word.

Background

Statistical Modeling

Maximum Likelihood(ML) and Smoothing for Missing tokens

Data Cleaning and preparation steps

How the Application work

Application Feature

Application Usage

High level End to End view

Libraries used to develop the product

Challenges developing the product

Image Credit

References

End of slides.