Word Prediction: Capstone project

GB
6/5/2016

Shiny app called Word Prediction aims to predict the next word based on already written words.
This app helps saving time while typing the text. Suggested words appear after typing-in starting several words.
The app is only useful if suggested words are predicted correctly and reasonably quickly.
This work is proof-of-concept and future funding is needed for work to further increase the model accuracy.

Support Material

Natural Language Processing is a new topic for me, and following material was used to develop the model:

“Text Mining Infrastructure in R” by Ingo Feinerar, Kurt Hornik and David Meyer published in Journal of Statistical Software, March 2008, V25, 5.

“Implementation of Modified Kneser-Ney Smoothing on Top of Generalized Language Models for New Word Prediction” by Martin Christian Korner, published in Institute for Web Science and Technologies, September 2013.
“An Empirical Study of Smoothing Techniques for Language Modelling” by Stanley F. Chen and Joshua Goodman, published by Center for Research in Computing Technology Harvard University, Cambridge, Massachusetts.
“Natural Language Processing” course by Dan Jurafsky, Christopher Manning (Stanford) offered on Coursera platform (https://class.coursera.org/nlp/lecture)

Data processing for this app was described here: https://rpubs.com/GintasBu/ExplorText.

Model Milestones

Available text was used to build n-gram model. In the final model bi-grams and tri-grams were used. 4-grams were also constructed, however due to response time and available computational resources to run the app, 4-grams were not used.

Number of bi-grams and tri-grams were reduced by forcing the n-gram to contain at least one word that was one the list of 1000 the most popular words. Bi-grams were reduced to about 81% of the initial bi-gram number and tri-grams to 90%.
Kneser-Ney smoothing was chosen as one of the better performing smoothing techniques as reported in above mentioned literature.
To reduce the response time (time that takes the app to predict next word) the smoothing Kneser_Ney smoothing was only applied to bi-grams.
To further reduce the response time index tables were build for bi-grams.
Input text is stemmed, punctuation, numbers and stop-words removed. The predicted words are stemmed and stop-words removed. Predicting words that are not stop-words are more useful.

Model Performance

To test the app model performance 1000 sentences were chosen from the provided blogs text file. Those sentences were first 1000 sentences that were not used in constructing n-grams. This was assured from using set.seed command in building n-grams and during the test inverting the same seed results. In selection for test:

set.seed(123456)
i<-rbinom(length(text), 1, 0.1)
text3<-text[which(i==0)]
text4<-text3[1:1000]

Selected sentences were pre-processed the same way as for the model: including bad and stop words removal and stemming. Sentences that had less than 3 words left after the pre-processing were removed. That left 928 sentences to test. In test sentences the last word was removed, and the remaining part of the sentences was used to predict the last word. The predicted word was compared to the removed word. In 113 out of 928 the prediction was correct, in percent yielding to:

[1] 12.2

Future work

Model improvements can be done in following steps:

Increase the size of text used to build n-grams. Currently only 10% of blogs, news and twitter text files were used. The reduced size was used due to limitations of computational resources and time allotted.
Build index table for tri-grams. Currently applying Kneser-Ney smoothing on tri-gram for word prediction is close to impossible due to computational time. In word prediction problem the smoothing is applied in real time and to reduce the computation time index tables are needed.
Include 4-grams with smoothing. Build index tables.