Shiny App Pitch: Predictive Text

J. Mark Shoun
24 March 2017

Application Screenshot

Input corpora were transformed as follows:
- All text converted to lowercase.
- Text split into tokens by all non-alphabetic characters.
- Words in Shutterstock's list of offensive words, not in the standard Linux dictionary, or appearing fewer than 5 times replaced with special [OTHER] token. 33,075 words in vocbulary.
Training data constructed as follows:
- Random 10% of documents in corpus selected.
- Predictors: 4 most recent words. If fewer than 4 words available, missing words replaced with special [BLANK] token.
- Response: next word in document. All examples with [OTHER] token as response removed.

Model Structure:
- Each word in vocabulary represented as a 32-dimensional vector.
- Model is a multinomial logistic regression with input of:
  - The vector for each of the four preceding words, plus
  - The mean vector of the four preceding words (160 predictors total).
- Output is predicted probability for each word in the dictionary.
Model Fitting:
- We have to find the vector representation for every word in the vocabulary, and
- the weights for our multinomial regression.
- Vector representation determined by a word2vec model.
- Multinomial regression fit via stochastic gradient descent using Tensorflow in R.