Roger Hu
11/3/2019
The main goal of this project is to build a Shiny application to predict the next word based on the immediate preceding words.
After the original corpora is sampled and processed (text cleaning and stemming):
quanteda package is used to create the N-gram model. N = 3 or tri-gram model are created for this particular applicationCoursera Data Science Capstone by John Hopkins University (Leek, J, Peng, R, & Caffo, B.) https://www.coursera.org/learn/data-science-project/home/welcome
N-Grams and Language Modeling: Jurafsky, D. & Manning, C. “Natural Language Processing - Lecture Slides from Standford Coursera Course”, https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
Modified Kneser-Ney smoothing: Chen, S. & Goodman, J. (1999) “An Empirical Study of Smoothing Techniques for Language Modeling” published in Computer Speech and Language (1999) 13, 359-394, http://www.idealibrary.com