Predictive Text Input

Karoly Kovago
06/08/2017

Capstone Project

Data Science Specialization

Predictive Text Input

THIS APPOLICATION dramatically shortens the time it takes for you to type in a text. It works like the predictive keyboards on your smartphone. For each character you enter, it offers the 3 most probable words on buttons, to complete your typing.

Performance

The less keys you need to type a text, the more useful the application is for you. This application is optimized, to balance accuracy against complexity, memory capacity and speed:

spares you 35% to 70% of the keystrokes, in typical texts
lightning fast, updating word predictions at each keystroke,
starts up quickly, loading typically in less than 2 seconds,
scans more than 1.3 million records of words and phrases to give you the most accurate predictions.

Under The Hood - Cleaning Text files

All texts are converted to lower case. Non-ASCII characters are removed, the files are cleaned of profanity.
To avoid separate predictions for expressions like “I'm” and “I am”, unambiguous instances of ' are removed (e.g. “we'll”, “you've”) and are transformed into “we will”, “you have”. In ambiguous situations (e.g. “you'd” can be “you would”“ or "you had”), the ' character is replaced with underscore(_) because the word transformations in R cannot all handle ' (“you'd” is transformed into “you_d”).
White spaces, URLs and Twitter account names, retweet remarks are removed. Infrequent words (with less than 10 occurences in all the text files) are collected and removed. One-character words, except for “a” and “i”, are removed. The cleaned text files are saved under new names.

Under The Hood - Creating N-gram Tables

After lengthy experiments with different packages, stylo is used to create n-grams. Its efficiency allows to process 90000 lines from each source files vs. 5000 with the tm package.
Source text files are processed one by one. N-grams from a specific file are vector-merged to global n-gram tables. To further speed up the application: bi- tri- and four-grams with 1 occurrence are removed. Bi-grams with two consecutive identical characters are removed. N-gram tables are created with 3 columns:
- the last word in the n-gram (this needs to be searched with grep syntax to match a word you are typing)
- the previous words (e.g. the first 3 words in four-grams) which can be fast-searched
- the frequency of the n-gram
These tables are saved as data source for the shiny app

The Shiny App

https://kkovago.shinyapps.io/predictive_text_input/

Start typing in the text box. Every key press triggers a prediction for the most probable words, displayed on 3 buttons. Click on them to add the predicted words to the text.
While the text field is empty, the 3 most frequent uni-grams are offered. When typing a word, the buttons show the 3 most probable completions. When you press space, the app predicts the 3 next words.

About the prediction

The core is fast binary search for the preceding words and grep-like search for the last word of the text. Weighted matrix calculations did not provide more accuracy.
The search falls back to lower level n-grams if no results are found. The app guarantees 3 word predictions.