Gil Huesca
10/30/2020
Smart keyboards make it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.
The developed app presents three options for what the next word might be when someone inputs a word. It can be used in https://gilhuesca.shinyapps.io/WordPredictorGilHuesca/. A user guide can be found in the second option in the app menu.
The application was trained using a sample of 100,000 texts from three datasets containing texts from Twitter, Blog posts and News publications in English. Contractions where changed to their extended forms. Then, the MC_tokenizer R function was applied to tokenize. Profanity words were removed using the list published by Google. All elements were transformed into lower case words.
3-gram, 2-gram and 1-gram frequency tables were created. The Kneser–Ney algorithm was applied so to find the probabilities for predictions for each element in the 2-gram table and in the 1-gram table. The words corresponding to the three greatest probabilities were stored in each table. This decision was made to have a better performance for the application because computing the probabilities on runtime was time consuming. By doing this, the application has only to look for the n-gram history and display its prediction.
File sizes (in bytes) 1-gram file = 5,201,960 = 76,372 elements. 2-gram file = 19,414,192 = 909,427 elements. 3-gram file = 1,870,754 elements = it was not stored in the app because it was used only for getting the predictions by means of the Kneser-Ney algorithm.
Average running times (in seconds)