Capstone Presentation

JP Dunlap
June 9, 2018

Word Prediction Shiny App

Designed to read user inputted text and predict the next word based on a specific machine learning algorithm using a large corpus of text as a source.

Design goals include accuracy and speed.

Background

As the head of the claims analysis division, you have tasked us with looking at ways we can increase the efficiency of the RPA exception process, especially related to incomplete open-ended responses. The approach we agreed on was to develop a machine learning approach to predict the missing or illegible word based on the previous three to 5 words which are available to us. This is our initial proof of concept prototype.

The approach taken to predict the next or missing word uses something called the “Stupid Backoff Algorithm.” Don't be fooled by its name. It was originally described in 2007 by Brants, Popat, Xu, Och, and Dean, all scientist at Google, in an article “Large Language Models in Machine Translation.” It is one of several approaches to using n-gram language models for word prediction. In the simplest sense, n-gram language models use the last “n” words to predict the next word. It turns out that the Stupid Backoff approach is both easier and better than many others.

The Process

In order to create an application that simulates the process that we will use to augment our RPA exception process, we chose to use a Shiny App written in R. Shiny has the ability to create a relatively small and portable app that can be run on multiple platforms.

The first step of the process was collecting the data used to build the algorithm. In this case we used a corpus of over 3 million words. The corpus was preprocessed to eliminate non-standard characters, non-English words, and other irregular words. The word phrases were then processed into n-grams of 1-grams, 2-grams, 3-grams and 4-grams with probabilities associated with each calculated using the Stupid Backoff Algorithm. This process is very computer intensive, but fortunately only needs to be accomplished one time. Once the n-grams are calculated a relatively small database is built.

The n-grams database, along with two very small bits of R code comprise the Shiny app and they are uploaded to a public app server for our use in prototyping.

The App

Word Prediction App

The app itself is quite simple, the user simply enters a phrase of as many words as they wish. The app immediately begins predicting the four words with the highest probability of being the next word.

In our application, we would normally be feeding it the immediately preceding three or four words in an effort to predict the missing or illegible next word.

Test and Recommendation

For the proof of concept, we ran 5 tests pulled directly from the RPA exceptions log. The APP suggested the next word in each case.

  1. (original text) after he said my (predicted) name
  2. (original text) once again the (predicted) best
  3. (original text) the last thing I (predicted) want
  4. (original text) once again the (predicted) best
  5. (original text) she went into labor at (predicted) home

In each of these cases the machine learning Stupid Backoff algorithm predicted a reasonable next word.

Our recommendation is to implement this approach as a bolt-on to the RPA exception solution engine, and monitor it for 90 days using the existing process. If it looks like the accuracy is reasonable then we can eliminate the monitoring and retask those individuals to a more value-added task.