Predicting next word:

investment pitch to raise $50k

December 14, 2014

The next word predicted is the most probable word in the context of training set
“Most probable” is defined as “most probable word given its history”
“Given its history” is greatly simplified by Markov Chain Rule: several preceding words can be used instead of the whole history
Bottom line: n-grams – fixed sequences of n words appearing one after the other – are used to predict next word

Step 1: Text normalization
- Converting to ASCII
- Garbage cleaning: dropping smilies, non-latin letters, funny sequences etc
- Dropping extra white spaces and lowercasing
Step 2: Fixing vocabulary
- Fixing vocabulary with words that appear at least twice. Singletons are substituted with <UNK> (resulting coverage of unigrams at 98% with circa 30k vocabulary).
Step 3: Tokenization
- Delimiting sentences with <s> and </s> tags and tokenizing text
Step 4: N-Grammification
- Breaking tokens into uni-, bi- and tri-grams (every sentence separately)
Step 5: Summarizing n-gram frequencies

App-Image

The app is hosted at https://sbushmanov.shinyapps.io/R_Shiny/
There are two steps in using the app:
- You: Type into the text box
- Model: Shows 3 top continuations if you pause typing for a while
The app has two panes:
- DEMO: the app itself
- FAQ: short description of the app and instructions on how to deploy the app at your site

$50k raised will be used to improve the app:

accuracy:
- implementing higher order n-grams
- using more sophisticated text normalization algorithms
- implementing more sophisticated interpolation while choosing best prediction candidates (e.g. Kneser-Kney or Good Turing)
performance (speed and size):
- recoding model into C++
- representing strings as integers (3x size reduction)
- hashing tables