Shiny Text Prediction

6/18/2017

Problem

Create a lightweight text prediction application that anticipates the next word in a phrase.
Given three large (~ 200 Mb) txt files from blog, news and twitter scrapes, contruct a codebase for cleaning, crunching and showing the best guesses for the next word.
Utilize ngrams and a Kat'z back-off model to estimate the next word based on observed frequencies.

Solution

N-grams

Deconstruct the text documents into ngrams, between 2-6 words in length.
Key the last word in that n-gram as the next_word and tally the frequency of the remaining words in, gram, as a group.
Sort and filter the tallys to keep the top entries required to cover >50% of all cases observed.

Back-off

Use a Kat'z back-off model to work backwards from the most specific prediction gram (i.e. 5 words) to the lest specific (i.e. 1 word)
For every gram that doesn't produce a guess match from our lookup, remove the first word from the gram and search again (i.e. "sometimes there isn't a match" -> "there isn't a match")
If we get all the way back to a single word and still can't produce a guess from our lookup, use the most common words in the lookup.

Code

ngram_algorithim.R

Reads in raw data, cleans data using library(tidytext) and contrusts a large list object, ngrams, with the 1-5 word-grams and the tallys of the most frequent next words. Also includes best_guessN() the function that performs the back-off approximation for the user input.

ui.R

Simple sidebar layout, themed with "United".

server.R

Utilizes RDS to read in stored lookup list form ngram_algorithim.R. Then uses reactive objects to respond to user input and trigger best_guessN() to calculate most probably next words. Output plot is in turn triggered to build with library(ggplot) using geom_col() + geom_label().

Shiny

The shiny server waits until the user starts typing before any computing. All caluculations are reactive so no buttons presses are required (although it takes it a second to complete all of the output).
The sidebar panel holds the raw text input box and displays the curent parsed input text being used for lookup.
The main panel shows the most popular next words and their relative percentage of occurances based on the current parsed input text.

https://nathanday.shinyapps.io/TextPred_Shiny/