Predicting the Next Word App

April 11, 2017

Predicting the next word –1

The app predicts the next word in a partial sentence based on previous words. The app uses n-grams and stupid backoff algorithm to predict the next word. The following steps were used to generate the n-grams

Datasets from three sources twitter, blogs, and news were available from swiftkey
The blog, news, and twitter datasets were sampled (10%).
The datasets were converted into a corpus and cleaned.
The sentences were tokenized into unigrams, bigrams, trigrams, quadragrams, and pentagrams and their frequencies estimated
Very low frequency n-grams (frequency<2) were removed and the n-grams were written to datasets

Predicting the next word –2

The app uses the unigram, bigram, trigram, quadragram, and pentagram datasets to predict.
The datasets are loaded when the app starts
The user enters a partial sentence (request is to use at least 2 words)
Based on the sentence next words are predicted with a stupid-backoff algorithm which calculates scores based on n-gram frequencies and penalty for backoff.

The stupid backoff Algorithm

The details of the scoring for the algorithm can be found at http://www.aclweb.org/anthology/D07-1090.pdf

App Layout

The user inputs a partial sentence and then hits the submit button.The predicted words and their back-off scores are presented in the table whereas the top 5 words are plotted in the main panel along with a wordcloud of upto the top 100 predictions.

Further Steps and Resources

The app predicts the next words but can have improvements in terms of the processing time
The other enhancements would be adding word completion component to the app as the user is typing the word

References

Gendron G.(2015). MODSIM World (13): 1:10
Jurafsky D and Martin J. (2014). Speech and Language Processing
Chen S and Goodman J. (1998). An empirical study of smoothing techniques for language modeling

The app is hosted at PredictNextWord
The github repository with the R-code is located at Github