Corpus Based Predictive Keyboard

Umut Kahramankaptan
January 2nd, 2017

Data Science Specialization - Capstone Project

John Hopkins University - Coursera

Predictive Model - Statistical Pre-processing

1 - A predictive text model is build via statistical analysis of the Corpus.

Large structured text sets which are used for statistical natural language processing within a specific language territory. [Wikipedia])
Provided by JHU. Includes blog entries, tweets and news articles from the Internet to focus on English content produced by mobile device users.
Corpus is sampled for performance, cleaned and standardized. Details can be found via updated Milestone report.

2 - Uni-grams, bi-grams, tri-grams and quad-grams are extracted for statistical prediction.

N-grams: a contiguous sequence of N items from a given … corpus. [Wikipedia])

Predictive Model - Katz's Back-off Model

Katz back-off is based on consecutive levels of n-grams to estimate the conditional probability of a word by using as much predecessing words as possible. More details can be found via Wikipedia.

The prediction algorithm starts from highest n-gram by using last n-1 words from the input as historical condition to find matching most frequent n-grams starting with the n-1 word input segment.
If there are not enough number of next word candidates or prediction options existing, n is decreased by 1 and step 1 is repeated.
When n=1, there are no historical conditions required. In such cases, the most frequent words from corpus are selected as candidates.
In our implementation, highest number of words used as historical condition is N-1=3 so N=4 and number of candidates is C=3.

Performance

Prediction Performance: Katz's Back-off implementation used in this application requires 85.92 ms (± 1.84 ms) in average to produce 3 proposed next words, which is well under mean visual stimulus response time (approximately 190 milliseconds) [Wikipedia].

It is observed that while the execution speed of user interface on a MacBook Pro (2.4 GHz Intel Core 2 Duo, 4 GB 1067 MHz DDR3) is still in the limits of mental chronometry, once it is deployed through ShinyApps.io-Free Plan the response time is decreased significantly. It will require a midscale mobile device in minimum.
Calculations can be found in performance_test.R provided with the application.
To optimize performance of the application, N-Grams will be saved in RDS format as part of the application. This approach increases the speed of application startup time by using memory of the host device.

How to use?

Conclusion and Future Work

Predictive Keyboard enables mobile users to type faster without typo, even in small screens. This Shiny application demonstrates its potential value and can be converted into a web service.

Implementation, updated milestone report and this presentation is available through GitHub

Future Work:

Looking for opportunities not only suggesting the next word, but also suggesting words which is being. The n-gram suggestion might be altered accordingly.
With more computational power available, I would like to use WordNet from Princeton University to remove or correct words to their proper english form such as a word “aaaalright” will become “allright” or it will be removed.