Word predictor app

Gil Huesca
10/30/2020

The task

Smart keyboards make it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

The developed app presents three options for what the next word might be when someone inputs a word. It can be used in https://gilhuesca.shinyapps.io/WordPredictorGilHuesca/. A user guide can be found in the second option in the app menu.

App

The solution

The application was trained using a sample of 100,000 texts from three datasets containing texts from Twitter, Blog posts and News publications in English. Contractions where changed to their extended forms. Then, the MC_tokenizer R function was applied to tokenize. Profanity words were removed using the list published by Google. All elements were transformed into lower case words.

3-gram, 2-gram and 1-gram frequency tables were created. The Kneser–Ney algorithm was applied so to find the probabilities for predictions for each element in the 2-gram table and in the 1-gram table. The words corresponding to the three greatest probabilities were stored in each table. This decision was made to have a better performance for the application because computing the probabilities on runtime was time consuming. By doing this, the application has only to look for the n-gram history and display its prediction.

Performance metrics

File sizes (in bytes) 1-gram file = 5,201,960 = 76,372 elements. 2-gram file = 19,414,192 = 909,427 elements. 3-gram file = 1,870,754 elements = it was not stored in the app because it was used only for getting the predictions by means of the Kneser-Ney algorithm.

Average running times (in seconds)

2-gram found: user=0.005, system=0.000, elapsed=0.005
2-gram not found, 1-gram found: user=0.007, system=0.001, elapsed=0.008
2-gram not found, 1-gram not found: user=0.009, system=0.000, elapsed=0.009

References

Nlp - 2.8 - Kneser-Ney Smoothing: https://www.youtube.com/watch?v=ody1ysUTD7o&list=PL3CEFE83D76F8B515&index=19“
N-Grams (Continued) http://www.cs.uccs.edu/~jkalita/work/cs589/2010/4Ngrams2.pdf
A simple numerical example for Kneser-Ney Smoothing [NLP] https://medium.com/@dennyc/a-simple-numerical-example-for-kneser-ney-smoothing-nlp-4600addf38b8
MC: A Toolkit for Creating Vector Models from Text Documents http://www.cs.utexas.edu/users/dml/software/mc/
N-gram Modeling With Markov Chains https://sookocheff.com/post/nlp/ngram-modeling-with-markov-chains/”
Good–Turing frequency estimation https://en.wikipedia.org/wiki/Good–Turing_frequency_estimation
Next Word Prediction using Katz Backoff Model https://rpubs.com/leomak/TextPrediction_KBO_Katz_Good-Turing
N-gram models http://www.cs.cornell.edu/courses/cs4740/2014sp/lectures/smoothing+backoff.pdf