http://predictkey.com
With highly efficient solutions such as SwiftKey using the trailing n-grams to predict the following word, we should not attempt to replicate them. Instead we should focus on their shortcomings.
Additionally with a contextual model, other opportunities open up outside of keystroke prediction; such a model could be in providing interactive writing assitance tools, similar to a thesaurus.
N-gram models are designed to quickly narrow the possible word choices down to a small set, using only the continuous sequence of trailing words prior to the predicted word. This works well for common phrases, but it does not identify subject matter. PredictKey instead of using continous n-gram sequences, uses a first the single trailing word and part-of-speech tags on preceding words to narrow the candidate predictions to a set of 50-100.
After identifying candidate words, a bag of words feature set of the preceding 10-20 words can be used to classify the candidate words by subject matter and bring the relevant words to the top.
To build a contextual system, we need a model that supports large dimensional sparse data, and will allow us to perform multiclass classification on potentially a large number of candidate words.
For this, we need to build a linear SVM model for each word that the system is able to predict. To build each of these models we sample both positive and negative examples for each word in the context that the word will be a candidate.
We can also use inverse document frequency weights together with regularization to prevent overfitting on common words.
Building the model can be time consuming, but this only needs to be performed once for training. After building the model, we are left with a set of sparse coeficient for each candidate word.
Using a database such as SQLite or PostgreSQL we are able to efficiently store and index these coeficients which allows for retrieval and model evaluation in less than 100ms. SQLite is provided on many mobile devices, allowing easy implementation.
For demonstration and testing purposes, A shiny app is provided at http://demo.predictkey.com. This application will allow the user to supply a piece of text, and it will estimate the probability of occurances next word. Two predictions are provided, a “prior” probability, which is the estimate of a bigram model, and a “score” which is after application of the contextual model. The model will update after a short pause whenever text is altered. From this it is easy to see the performance of the model.
In the near future, and API will be provided as well for integration with third party applications.