6/18/2017

Problem

  • Create a lightweight text prediction application that anticipates the next word in a phrase.

  • Given three large (~ 200 Mb) txt files from blog, news and twitter scrapes, contruct a codebase for cleaning, crunching and showing the best guesses for the next word.

  • Utilize ngrams and a Kat'z back-off model to estimate the next word based on observed frequencies.

Solution

N-grams

  • Deconstruct the text documents into ngrams, between 2-6 words in length.
  • Key the last word in that n-gram as the next_word and tally the frequency of the remaining words in, gram, as a group.
  • Sort and filter the tallys to keep the top entries required to cover >50% of all cases observed.

Back-off

  • Use a Kat'z back-off model to work backwards from the most specific prediction gram (i.e. 5 words) to the lest specific (i.e. 1 word)
  • For every gram that doesn't produce a guess match from our lookup, remove the first word from the gram and search again (i.e. "sometimes there isn't a match" -> "there isn't a match")
  • If we get all the way back to a single word and still can't produce a guess from our lookup, use the most common words in the lookup.

Code

ngram_algorithim.R

Reads in raw data, cleans data using library(tidytext) and contrusts a large list object, ngrams, with the 1-5 word-grams and the tallys of the most frequent next words. Also includes best_guessN() the function that performs the back-off approximation for the user input.

ui.R

Simple sidebar layout, themed with "United".

server.R

Utilizes RDS to read in stored lookup list form ngram_algorithim.R. Then uses reactive objects to respond to user input and trigger best_guessN() to calculate most probably next words. Output plot is in turn triggered to build with library(ggplot) using geom_col() + geom_label().

Shiny

  • The shiny server waits until the user starts typing before any computing. All caluculations are reactive so no buttons presses are required (although it takes it a second to complete all of the output).

  • The sidebar panel holds the raw text input box and displays the curent parsed input text being used for lookup.

  • The main panel shows the most popular next words and their relative percentage of occurances based on the current parsed input text.

https://nathanday.shinyapps.io/TextPred_Shiny/