Shiny Word Prediction

Philipp B.
2021-06-05

The Mission

With this Shiny application a handy and lean method for inline word prediction is provided. A simple design allows easy usage while the main features of the prediction algorithm under the hood are:

fast word prediction
responsive interaction with the user and inline suggestion of the next word
a brief selection of alternative predictions with a score in arbitrary units

The App

The user interface consists of a simple text input field based on which the algorithm predicts a selection of probable subsequent words. The predicted words are ranked by a score and shown in a plot below where the highest ranked predicted word is suggested inline in the input box.

The predicted word shown inline is overwritten if the newly typed letters by the user match the prediction, otherwise the predicted word is shifted to the right and will eventually be replaced by a new prediction.

Under the hood

The main focus during the development of the app was the responsiveness of the user interface as well as a very fast performing prediction algorithm while maintaining a reasonable prediction accuracy.

The algorithm uses a hierarchical decision tree resulting in three main levels. First a combination of 2-, 3- and 4-grams is used for prediction. If no match is found, a combination of skip-grams and ngrams(without stopwords) is used in the second level. At last, the word probability of the top 50 words of the training data is taken into account.

Limitations

One limitation of this implementation is the usage of an arbitrary score rather than actual probabilities. Hence the score only states the significance for a certain prediction instance - i.e. a score of 0.5 in one prediction result cannot be compared with 0.5 of another prediction result table.

Since the focus was on responsiveness and fast loading times at startup, the size of the model was reduced removing some information and shrinking the n-gram models. While this increased the performance of the model and reduced the file size, this also resulted in a slight decrease of accuracy.