KILOS™ Predictive Text

Tom Lous
2017-01-01

Unlocking the power of predictive text

KILOS™ Predictive Text unlocks the power of predictive text at your fingertips! At least… It unlocks a nice effort and a so-so experience, but it holds a lot of promise, sort of.. a bit.. well at least I tried.

Anyway: KILOS™ consists of 3 parts:

  • Model
  • Predictive Algorithm
  • Shiny Application

Model

We've sampled 15% of supplied twitter, news & blog data, tokenized them all, resulting in >1.4M English tokenized sentences.

These were used as input for the n-gram tokenizer. For accuracy we've generated all 1-grams until 7-grams (as far as possible) of these sentences and generated frequency tables for each unique gram.

We stored these using a lookup part (first n-1 tokens), the so called datum and the suggest part (the n-th gram)

For speed reasons we only kept tokens with frequency > 1 and only the unique lookups, so only 1 suggestion for each lookup combination exists.

The resulting n-grams:

1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram
# 1 45577 261794 229345 101001 41964 23201

Predictive Algorithm

Queries submitted are preprocessed and tokenized resulting in any number of grams.

Starting from the last 7 (or less) grams of the query the Stupid Backoff algorithm is recursivly applied

\[ S ( w_i|w_{i-1}^{i-k+1} ) = \left\{ \begin{array}{ll} \frac{f(w_{i-k+1}^i )}{f(w_{i-k+1}^{i-1} )} \qquad \, \textrm{if} \, f(w_{i-k+1}^i ) >0\\ \alpha S(w_{i-k+1}^i ) \quad \textrm{otherwise} \end{array} \right. \]

We use \( S \) to denote that these are scores, not probabilities.

\( f(·) \) are the pre-computed and stored relative frequencies and \( \alpha \) is the fixed back-off value, heuristicly initialized at \( 0.4 \)

We aggregate all matching grams and sort them by their scores descending.

Shiny Application

Finally the condensed lookuptables and prediction algorithm are stored and made available in a shiny app.

The app uses twitter's typeahead plugin to really experience the autocomplete & suggestion possibilities of the combined model & algorithm.

While you are typing the algorithm tries to partially lookup the string aleady and makes suggestions as you type. The combination of the reaktive nature and speed of the condensed algorithm & model truely highten the user experience.

Check it out yourself at: KILOS™ Predictive Text