Jerome Smith
16th July 2016
Data Science Capstone Project
Johns Hopkins University
Coursera
Wouldn't you like to able to write words on your smartphone with just one touch, instead of all that typing?
This application does precisely that: it predicts the next four words you are likely to type, based on what you have already typed.
It doesn't get in your way: its suggestions are optional. If you don't want any of the words it suggests, just keep on typing in the usual way.
You can see a working demo here:
The app uses a prediction algorithm based on n-grams. An n-gram is the last n words of a phrase. For example, in “Oh, what a beautiful”, the associated 1-gram, 2-gram and 3-gram are: “beautiful”, “a beautiful” and “what a beautiful”, respectively.
The prediction model is trained with large quantities of text pulled from Twitter, blogs and news web sites. The algorithm extracts all the various n-grams from the text, and for each one finds the four next words that occur most frequently. It stores these n-grams with associated most likely next words in
probability tables, one for each type of n-gram.
Every time you type, the app looks up the greatest n-gram from your phrase in its associated probability table and finds the associated next words. If the n-gram is not present in the table, it tries with the next lower n-gram (n-1). Ultimately it will look up the associated 1-gram (a single word). It will almost always find an association, since albeit there are unlimited n-grams, there are relatively few possible 1-grams (words in the English language).
The demo application has been developed using R statistical computing language. It consists of five components:
The training algorithm collects all observations of n-grams and associated next words from the data and from them builds the probability tables.
The prediction function takes a phrase as input and uses the probability tables loaded into memory to look up the most probable next words.
The demo uses up to 4-grams and hence has 4 probability tables.
This application is intended to be used on smartphones, and must run quickly in order to be of practical value, despite the limited memory and processing power of most smartphones. Therefore we propose the following extensions to its design in order to improve both its performance and accuracy:
Write the smartphone version of the application in a high-performance low-level language such as C++.
Continuously re-train automatically, customising itself to the user's particular way of writing, adding the user's own n-grams and words to the probability tables. This will improve predictive accuracy significantly.