Quick Key Word PredictR Tool

Matt Cliff
10 March 2018

Quick Key Next Word PredictR Tool

The Quick Key Next Word PredictR tool will help users typing in sentences or other natural language text by providing the most likley next word with single key auto-completion.

Background

Improving a users ability to accurately type more words per minute in a natural language interaction such as email, chat, or other social media will increase the efficiency of usage.

In our application, the user once the user completes a word by hitting space, the tool will presnet the most likley next words, and other select a button to auto-populate one of the top three words, view the top 10 scores, or view a word cloud.

Milestone Report

Word Prediction (Back-Off) Approach

The tool is trained on a set of text from three sources, news feeds, twitter and blogs provided here. This body of text (corpus) is first cleaned, and frequency tables of the n-grams¹ are generated.

This tool utilizes the Stupid Back-Off² algorithm which attempts to match the longest n-gram sequence, and calculates a score by the count of that sequence divided by the count of the n-1 length prefixes. Each time we step back on the length the score is a factor \( \alpha \) which will reduce the impact of shorter length matches.

An n-gram is a sequence of n tokens or words.
Back Off Approach See section 4.5.1 here

Details of the Algorithm

Score each n-gram \( w_1^n := w_1 w_2 \cdots w_n \) as follows

\[ S\left(w_n \middle\vert w_1^{n-1} \right) := S\left(w_n \middle\vert w_1, \cdots, w_n\right) = \begin{cases} \frac{C(w_{i}^n)}{C(w_{1}^{n-1})},& \text{ if } C(w_{i}^n) > 0\\ \alpha S(w_n \vert w_{2}^{i-1}),& \text{otherwise} \end{cases} \]

The n-grams are generated from the sample corpus after removing non US-keyboard characters, punctuation, and numbers, tokenize then filter profane and non-dictionary words. Keep only the n-gram that have a frequency of 2 or more (for \( n>2 \)).

The score \( S(w_1^n) \) is pre-calculated for every n-gram, and the tables are stored with the prefix and tail in seperate columns.

When invoked, the tool cleans the input to generate a prefix \( w_1^{n-1} \), generates a pool of likely \( w_n \) by checking the 100 most likley from all prefix and scores each of those. The results are displayed with the highest score on top.

How to Get Started

The tool is used by entering text into the Text Area, after a space has been entered it will trigger the algoritm.

put it in the fridge

The \( \alpha \) slider allows control for the look-back weight. The table tab shows the 10 highest scoreing words, along with the score and from which n-gram table. The Word Cloud feature allows the user to see a much larger set of liklinext words.

In a production application we would expect the choice of top three buttons be incorporated into the typing interface as 3 (or more) dynamic hotkeys.