Quick Key Word PredictR Tool

Matt Cliff
10 March 2018

Remove this Slide

For more details on authoring R presentations please visit https://support.rstudio.com/hc/en-us/articles/200486468.

Criteria

Does the slide deck contain a description of the algorithm used to make the prediction?
Does the slide deck describe the app, give instructions, and describe how it functions?
How would you describe the experience of using this app?
Does the app present a novel approach and/or is particularly well done?

Quick Key Next Word PredictR Tool

The Quick Key Next Word PredictR tool will help users typing in sentences or other natural language text by providing the most likley next word with single key auto-completion.

Background

Improving a users ability to accurately type more words per minute in a natural language interaction such as email, chat, or other social media will increase the efficiency of usage.

In our application, the user once the user completes a word by hitting space, the tool will presnet the most likley next words, and other select a button to auto-populate one of the top three words, view the top 10 scores, or view a word cloud.

(Milestone Report)[http://rpubs.com/mcliff/milestone1]

Word Prediction (Back-Off) Approach

The tool is trained on a set of text from three sources, news feeds, twitter and blogs provided here. This body of text (corpus) is first cleaned, and frequency tables of the n-grams¹ are generated.

This tool utilizes the Stupid Back-Off² algorithm which attempts to match the longest n-gram sequence, and calculates a score by the count of that sequence divided by the count of the n-1 length prefixes. Each time we step back on the length the score is a factor \( \alpha \) (0.4 by default) which will reduce the impact of shorter length matches.

An n-gram is a sequence of n tokens or words.
Back Off Approach See section 4.5.1 here

Details of the Algorithm (4)

The corpus first has all non US-keyboard characters, punction and numbers removed. It is then converted to lower case matched up against a dictionary and has profane words removed. Each text entry is turned into n-grams up to length \( 5 \). For the 1-gram (which is the list of words), and 2-gram we keep all occurances, for all other lengths we drop unique occurances to manage the memory. The tool was able to process \( 85 \)% of the raw data, resulting in approximately 75 MB of storage space required. The tables are stored with the prefix and tail in seperate columns, the count, as well as the score which is the count divided by the count of the frequency of the prefix in the (n-1) gram table.

\[ S\left(w_n \middle\vert w_1^{n-1} \right) = S\left(w_n \middle\vert w_1, \dots, w_n\right) = \begin{cases} \frac{C(w_{i}^n)}{C(w{i+1}^n)},& \text{ if } x > 1\\ \alpha S(w_n \vert w_{2}^{i-1}),& \text{otherwise} \end{cases} \] double check

Details on this slide about how we clean, maybe some details of what we built in n-gram tables.

Details about how we pick which words to score

Details about the scoring algorithm

The Word Cloud feature in the UI and the data table help provide additional insight into the scores.

In a production application we would expect the choice of top three buttons be incorporated into the typing interface as 3 (or more) dynamic hotkeys.

How to Get Started (5)

How to Use

The application we are releasing is a Shiny app which after loading you can type any text in the text entry box, and in real time the application will update the predicted next word.

put it in the fridge

Quote - This prototype application demonstrates the predictivate capabilities, it can be embedded into any application such as text or email to help typing.

Include a quote from the product shop about this.