Predicting next word:

investment pitch to raise $50k

December 14, 2014




Code:
https://github.com/sbushmanov/R_DataScientist_Capstone
App:
https://sbushmanov.shinyapps.io/R_Shiny

Description of the prediction algo

  • The next word predicted is the most probable word in the context of training set
  • “Most probable” is defined as “most probable word given its history”
  • “Given its history” is greatly simplified by Markov Chain Rule: several preceding words can be used instead of the whole history

  • Bottom line: n-grams – fixed sequences of n words appearing one after the other – are used to predict next word

Model building

  • Step 1: Text normalization
    • Converting to ASCII
    • Garbage cleaning: dropping smilies, non-latin letters, funny sequences etc
    • Dropping extra white spaces and lowercasing
  • Step 2: Fixing vocabulary
    • Fixing vocabulary with words that appear at least twice. Singletons are substituted with <UNK> (resulting coverage of unigrams at 98% with circa 30k vocabulary).
  • Step 3: Tokenization
    • Delimiting sentences with <s> and </s> tags and tokenizing text
  • Step 4: N-Grammification
    • Breaking tokens into uni-, bi- and tri-grams (every sentence separately)
  • Step 5: Summarizing n-gram frequencies

Description of the app

App-Image

  • The app is hosted at https://sbushmanov.shinyapps.io/R_Shiny/
  • There are two steps in using the app:
    • You: Type into the text box
    • Model: Shows 3 top continuations if you pause typing for a while
  • The app has two panes:
    • DEMO: the app itself
    • FAQ: short description of the app and instructions on how to deploy the app at your site

Use of proceeds

$50k raised will be used to improve the app:

  • accuracy:
    • implementing higher order n-grams
    • using more sophisticated text normalization algorithms
    • implementing more sophisticated interpolation while choosing best prediction candidates (e.g. Kneser-Kney or Good Turing)
  • performance (speed and size):
    • recoding model into C++
    • representing strings as integers (3x size reduction)
    • hashing tables