Introducing The Next Word Prediction APP

Roger Hu
11/3/2019

The main goal of this project is to build a Shiny application to predict the next word based on the immediate preceding words.

Training Data: the original corpora is provided by the company Swiftkey and can be accessed here
The original corpora is consisted of over 4.2 million lines of English text from three sources: Twitter, blogs, and news articles. For practical reasons, 10% of lines from original corpora were randomly sampled and used to build the prediction model for this project.
The Next Word Prediction App is hosted on Shiny.IO server

After the original corpora is sampled and processed (text cleaning and stemming):

quanteda package is used to create the N-gram model. N = 3 or tri-gram model are created for this particular application
the 'Modified Kneser-Ney Smoothing' is applied on the tri-gram model
data.table' package is used for performing calculations and retriving data/making predictions based on the smoothed tri-gram model
the model uses the two immediately preceding words as the “base” for making the prediction of the next following word

title

Start typing into the input box located on the left side of the app
Words used for making the prediction and the top 5 predicted following words are displayed on the right side
Please note that for both user's input and the training text data:
- English stop words such as are removed
- Words are stemmed

Coursera Data Science Capstone by John Hopkins University (Leek, J, Peng, R, & Caffo, B.) https://www.coursera.org/learn/data-science-project/home/welcome
N-Grams and Language Modeling: Jurafsky, D. & Manning, C. “Natural Language Processing - Lecture Slides from Standford Coursera Course”, https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
Modified Kneser-Ney smoothing: Chen, S. & Goodman, J. (1999) “An Empirical Study of Smoothing Techniques for Language Modeling” published in Computer Speech and Language (1999) 13, 359-394, http://www.idealibrary.com