Amyas Walji
November, 2019
The recent advances in mobile technology have enabled people to stay connected on the go. However, to communicate by text swiftly without error on a smartphone keyboard is rather challenging without any language prediction.
In this backdrop, the purpose of the Data Science Capstone project was to develop a data product which uses natural language processing to predict the next word a user may want to type. A shiny application serves as the final product for this project.
The application is available at the following link: Shiny App.
The prediction algorithm developed is based on an N-Gram Language Model. This model uses a Markov Assumption in which each word depends only on the previous N words in a given string of text. For this project a series of four N-gram models have been constructed:
Given this dictionary of N-grams, the probability for possible next words is calculated through Maximum-Likelihood Estimation combined with Good Turing Smoothing. Depending on the length of text input, the prediction algorithm selects candidates from appropriately ranked N-grams. This mechanism is known as the “The Stupid Backoff algorithm”, a highly efficient and inexpensive algorithm proposed by Thorsten Brants et al (2007).
The Shiny app dashboard:
Shiny App
Link.
Shiny App Source Code on Github
Link.
Data Science Capstone on Coursera
Link.
References Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean. (2007), Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, pages 858-867.