Gabor Simon
2018-03-25
Typing letters one by ___ takes a lot of time, so let's speed things __ a bit! As you see, quite a lot __ our words can be figured ___ from the context, so an intelligent Input Method could do that for us __ well.
The top 3 categories that are easy to guess:
The secondary guess uses a simple coincidence statistics: which word pairs occur frequently in the same sentence (regardless of position).
The first strategy is based on an N-gram distribution: by the last N words we predict what words might follow them with what chances.
If we still don't know enough to make a guess, fall back to suggesting the most common words.
To avoid the frequent but general words our metrics aren't just probabilities, but Bayesian classifiers \( \frac{P(B|A)}{P(B|\bar{A})} \), so we measure how much a word is affecting the other, be it positive or negative.
Two main parts: suggested words (4) and the input field.
Two operating modes:
The user may type into the input field, and the predictions are re-calculated after each letter. Choosing a prediction completes the partially entered word, and a new word is started after a space or a punctuation.
The input field can be edited or pasted into, but the predicting must be started manually by a Generate button. (Mostly useful for entering test sentences.)
This amount of data required quite a toolkit: the text normalisation was done in Python, the intermediate storage is an SQLite DB, the mass object counting is written in C (for performance), and the demonstration app is written in R+Shiny.