What's the Next Word????

Julie GW
April 20, 2015

Task: …Create a Shiny App to predict the next word given a phrase or partial sentence

Goals & Considerations

  • Balance Speed and Accuracy

    • It was suggested that the app might need to be suitable for a mobile platform (small in data and screen complexity)
    • Accuracy can be measured multiple ways; it is really in the eyes of the user. The “right” answer from the data set may not be what the user is “thinking”, so is it still right? It needs to respond with a reasonable output and be robust to all inputs. The algorithm needs to be configured to give the best possible answer with the data available.
  • Intuitive, Inviting & Upbeat

    • Colorful (warm bright color) with an image
    • Simple in layout and easy to use
  • Family Friendly (no profanity allowed!)

Special Features

  • Markov Models (n-grams) are the basis for this algorithm

    • Efficiency is improved by ordering the n-grams from most to least probable, stripping out all but the list of terms & using a search algorithm that finds the first instance.
    • Accuracy is addressed by nesting the n-grams to search the highest order n-gram first. Back-off techniques were used to move through the algorithm.
  • Smoothing techniques balanced infreqent terms for data size

    • Accuracy Smoothing all terms with a freq = 1 with “unk” can give too high a proportion of “unk” in a trimmed data set. So 1000 freq=1 terms were replaced with “unk” prior to being made into n-grams and the rest were deleted from the data sets in their n-gram state.
    • Robustness The algorithm was made “open vocabulary” by creating a dictionary of only terms in the n-grams (now including “unk”) to check against input words. Input words not in the list were changed to “unk”.

Algorithm

The algorithm embedded in server.R has 3 parts:

  1. Load preprocessed, cleaned, ordered and other-wise minimalized data, including a vocabulary list.

  2. The prediction algorithm as a function

    • Stores a copy of the input, then preprocesses the input and grabs the last 3 words
    • Checks against vocabulary list, adds “unk” as needed
    • Uses nested regular expressions to find best match
    • Pulls out the last word from the n-gram and pastes it on the end of the original sentence
  3. Server interaction which calls the input and output and makes it reactive

Instructions & Link

Click on the link below to go to the app. Type a phrase or partial sentence into the “Text Input” box and hit the “Predict” button. Your phrase with the next word will appear in the “Prediction” box below. Note that you only have to type in one word to get a prediction…. link