Next Word Prediction App

Data Science Specialization Capstone Project

Johns Hopkins Bloomberg School of Public Health




Author: Abiyu Giday
Date: April 22, 2016
http://abiyu.github.io

Agenda

  • Motivation
  • Background
  • Statistical Modeling
  • Maximum Likelihood & Smoothing
  • Data Cleaning steps
  • Application features
  • Application Usage
  • Image Reference
  • Reference

Motivation


Develop a data product that predicts

the most likely next word in English sentence

based on preceding word.


Background

  • The product is developed using an English corpus( large set of text) that contains news articles, blogs and tweets.
  • The combined corpus, before cleaning, contains 4.3 Million lines and 102.4 Million words. (big data territory.)
  • There are about 1 million words in the English language as of Jan 1, 2014*.
  • Everyday conversation requires roughly 2500-3000 actively used words to communicate effectively.(with passive vocabulary words not used often, totaling about 15k-20k.)
  • Naturally spoken English words have high correlations. Which means the probability of the next word in a sentence is dependent on the preceding consecutive words.

Statistical Modeling

  • Naturally spoken English words have high correlations. Which means the probability of the next word in a sentence is dependent on the preceding consecutive words.
  • Because english language words are highly correlated, the probability of seeing any word W given the previous words w1, w2….Wn-1 can be tabulated with Markov assumption[Martin Christian Körner].
  • As an example, given a sentence “the quick brown fox”, the probability of “fox” following “the quick brown”, can be tabulated as follows.

Maximum Likelihood(ML) and Smoothing for Missing tokens

  • n-grams (word tokens) and their respective frequencies of occurance were tabulated from the sampled data (those that were seen in the corpus), Good-Turing discount smoothing technique was applied to fill the gaps for value for unseen tokens words, however due limitation on my home PC were left out (permutations words in unigram just was too much data to process with home PC).
  • After stacking the n-gram tokens, backoff back off technique was applied to prioritize the trigram then bigram finally the unigram. (A lambda value is selected after few trials for best outcome.)

Data Cleaning and preparation steps

How the Application work

  • The data would have been indexed and listed in the order in which they are most frequently used. The application will return the most frequent use of the word in a sentence in naturally spoken English..
  • For Example if there are 10 sentences that start with 'I love ' the 3 that will be returned will have been the once that come after 'I love ' example 'I love you' or 'I love him' or 'I love her' vs 'I love cars' or 'I love cows' etc.
  • The application go one step further and return the searches based on the position of the word in a sentence as it is frequently used.

Application Feature

Application Usage

High level End to End view

  • The app is deployed on shinyapps.io and can be found here.

Libraries used to develop the product

  • library('tm')
  • library(openNLP)
  • library(stringi)
  • library(dplyr)
  • library(stringr)
  • library(qdap)
  • library(shiny)
  • library(shinythemes)
  • R version 3.2.3 (2015-12-10)
  • Unix command line functions with regex to help with some of the cleaning.

Challenges developing the product

  • The combined corpus, before cleaning, contains 4.3 Million lines and 102.4 Million words. (This is big data territory.) Because the model was built on home computer, that is limited in processing/memory, a very small sample(1%) was used and conditional probability statistical model is applied to predict the next word.
  • Some of the packages were too slow to create n-grams and other were too slow to clean the data, for this UNIX was used.
  • Even with 1% of the corpus the trigram/bigram Good-Touring smoothing for unseen data explode beyond the means of personal computers to process.

Image Credit

  • Conditional Probability with Markov Assumption formula image (slide 5) Martin Korner: Introduction to Knesner-Ney smoothing on Top of Generalized Language Models for Next Word Prediction.
  • Maximum Likelihood and Example (slide 6) Martin Korner: Introduction to Knesner-Ney smoothing on Top of Generalized Language Models for Next Word Prediction.
  • Word Cloud (slide 2) - Abiyu Giday
  • High Level (slide 3 & 9) - Abiyu Giday
  • Data Cleaning Steps (slide 7) - Abiyu Giday
  • Application features (slide 9-11) - Abiyu Giday

References

  • John Fry - Lecture slides - Bigrams and Trigrams, Boise State University.
  • Martin Christian Körner - Implementation of Modified Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction. Universitat Koblenz Landau
  • Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, and Wouter van A t t ev el d t, RTextTools: A Supervised Learning Package for Text Classification
  • William A. Gale and Geoffrey Sampson - Good-Turing Frequency Estimation Without Tears.http://www.grsampson.net/AGtf1.html
  • Kurt Hornik and Duncan Murdoch - Watch Your Spelling
  • Kurt HOrnik - RWeka Odds and Ends
  • Dept. of Linguistic 2009 - Smoothing - Indian University
  • Hadley Wickham - stringr Vingette - Character String Processing Facilities