Patrick Simon
August 14th 2019
A text prediction app
Capstone project for the JHU Data Science Specialization
The goals
1) Develop a text prediction app like the ones you might see on a cellphone.
2) Given an input of one or more words, the app should make a prediction for the next word.
3) The app should respond quickly, and not use up too much memory.
The method
1) Use Twitter/Blog/News texts (source: HC Corpora) to build tables of most common word sequences (n-grams).
2) Optimize tables for small storage size and fast access.
3) Use simple backoff model for determining the most likely prediction.
rickPredict [Link / GitHub] lets the user type into a text field, and uses up to the last four words to make its top 3 predictions for the next one.
It looks up the scores for possible 2-grams to 5-grams, according to the Stupid Backoff model by Brants et al. [Link] with \( \alpha \)=0.4. All scores for n-grams with a frequency of 4 or higher were pre-computed and stored in tables on the server.
The list of possible n-grams was filtered to exclude profanity, numbers, URLs and various other special characters.
To assess the quality of the app, a benchmark program was run [Link]. It uses the algorithm on a set of tweets and and blog texts to determine speed, memory usage and accuracy.
In comparison with the results of prediction algorithms from fellow students, rickPredict delivers relatively satisfying accuracy, with very good results for speed.
Current word prediction - this shows a prediction for which word you might currently be typing, based on the letters you have already put in and the previous words.
Expert mode - this shows you more numbers for the top 3 predictions, namely the frequency and order of their respective n-grams, as well as the calculated backoff score. You can also change the value of \( \alpha \).