Shiny App Pitch: Predictive Text

J. Mark Shoun
24 March 2017

Predictive Text

Application Screenshot

  • My application predicts the next word in a stream of text in real time.
  • Provides list of top five most likely words to follow.

Model Design - Data Input

  • Input corpora were transformed as follows:
    • All text converted to lowercase.
    • Text split into tokens by all non-alphabetic characters.
    • Words in Shutterstock's list of offensive words, not in the standard Linux dictionary, or appearing fewer than 5 times replaced with special [OTHER] token. 33,075 words in vocbulary.
  • Training data constructed as follows:
    • Random 10% of documents in corpus selected.
    • Predictors: 4 most recent words. If fewer than 4 words available, missing words replaced with special [BLANK] token.
    • Response: next word in document. All examples with [OTHER] token as response removed.

Model Design - Model Fitting

  • Model Structure:
    • Each word in vocabulary represented as a 32-dimensional vector.
    • Model is a multinomial logistic regression with input of:
      • The vector for each of the four preceding words, plus
      • The mean vector of the four preceding words (160 predictors total).
    • Output is predicted probability for each word in the dictionary.
  • Model Fitting:
    • We have to find the vector representation for every word in the vocabulary, and
    • the weights for our multinomial regression.
    • Vector representation determined by a word2vec model.
    • Multinomial regression fit via stochastic gradient descent using Tensorflow in R.

Application

  • Model scores are updated in real time as text is input.
  • App returns top five predictions in order, with confidence weights on each.
  • Average latency is < 1 second.