Shiny App Pitch: Predictive Text
J. Mark Shoun
24 March 2017
Predictive Text
- My application predicts the next word in a stream of text in real time.
- Provides list of top five most likely words to follow.
Model Design - Data Input
- Input corpora were transformed as follows:
- All text converted to lowercase.
- Text split into tokens by all non-alphabetic characters.
- Words in Shutterstock's list of offensive words, not in the standard Linux dictionary, or appearing fewer than 5 times replaced with special [OTHER] token. 33,075 words in vocbulary.
- Training data constructed as follows:
- Random 10% of documents in corpus selected.
- Predictors: 4 most recent words. If fewer than 4 words available, missing words replaced with special [BLANK] token.
- Response: next word in document. All examples with [OTHER] token as response removed.
Model Design - Model Fitting
- Model Structure:
- Each word in vocabulary represented as a 32-dimensional vector.
- Model is a multinomial logistic regression with input of:
- The vector for each of the four preceding words, plus
- The mean vector of the four preceding words (160 predictors total).
- Output is predicted probability for each word in the dictionary.
- Model Fitting:
- We have to find the vector representation for every word in the vocabulary, and
- the weights for our multinomial regression.
- Vector representation determined by a word2vec model.
- Multinomial regression fit via stochastic gradient descent using Tensorflow in R.
Application
- Model scores are updated in real time as text is input.
- App returns top five predictions in order, with confidence weights on each.
- Average latency is < 1 second.