Matt Cliff
10 March 2018
For more details on authoring R presentations please visit https://support.rstudio.com/hc/en-us/articles/200486468.
Criteria
The Quick Key Next Word PredictR tool will help users typing in sentences or other natural language text by providing the most likley next word with single key auto-completion.
Improving a users ability to accurately type more words per minute in a natural language interaction such as email, chat, or other social media will increase the efficiency of usage.
In our application, the user once the user completes a word by hitting space, the tool will presnet the most likley next words, and other select a button to auto-populate one of the top three words, view the top 10 scores, or view a word cloud.
(Milestone Report)[http://rpubs.com/mcliff/milestone1]
The tool is trained on a set of text from three sources, news feeds, twitter and blogs provided here. This body of text (corpus) is first cleaned, and frequency tables of the n-grams1 are generated.
This tool utilizes the Stupid Back-Off2 algorithm which attempts to match the longest n-gram sequence, and calculates a score by the count of that sequence divided by the count of the n-1 length prefixes. Each time we step back on the length the score is a factor \( \alpha \) (0.4 by default) which will reduce the impact of shorter length matches.
The corpus first has all non US-keyboard characters, punction and numbers removed. It is then converted to lower case matched up against a dictionary and has profane words removed. Each text entry is turned into n-grams up to length \( 5 \). For the 1-gram (which is the list of words), and 2-gram we keep all occurances, for all other lengths we drop unique occurances to manage the memory. The tool was able to process \( 85 \)% of the raw data, resulting in approximately 75 MB of storage space required. The tables are stored with the prefix and tail in seperate columns, the count, as well as the score which is the count divided by the count of the frequency of the prefix in the (n-1) gram table.
\[ S\left(w_n \middle\vert w_1^{n-1} \right) = S\left(w_n \middle\vert w_1, \dots, w_n\right) = \begin{cases} \frac{C(w_{i}^n)}{C(w{i+1}^n)},& \text{ if } x > 1\\ \alpha S(w_n \vert w_{2}^{i-1}),& \text{otherwise} \end{cases} \] double check
Details on this slide about how we clean, maybe some details of what we built in n-gram tables.
Details about how we pick which words to score
Details about the scoring algorithm
The Word Cloud feature in the UI and the data table help provide additional insight into the scores.
In a production application we would expect the choice of top three buttons be incorporated into the typing interface as 3 (or more) dynamic hotkeys.
The application we are releasing is a Shiny app which after loading you can type any text in the text entry box, and in real time the application will update the predicted next word.
Quote - This prototype application demonstrates the predictivate capabilities, it can be embedded into any application such as text or email to help typing.
Include a quote from the product shop about this.