8/27/2020

What is this App about

  • Please find the link to my Shiny App below: https://zhaozheng.shinyapps.io/PredictNextWord/

  • You can enter any number of words, and my app will predict the next word you want to say.

  • The outcome is:

    1. the best prediction from my model, i.e. the word with the highest probability to appear after the input text.
    2. two alternative predictions, which have the 2nd and 3rd highest probability.

Algorithm

  • I use the simplest statistical language model, n-gram model, to predict the probability of next word, given the n-1 words before it.

  • A n-gram matrix will be built using a training corpus. This matrix contains all the n-word phrases (n-grams) appear in the corpus, together with their frequency which will be translated into probability.

  • With any new text provided, the last (n-1) words will be used to match with the first (n-1) words in each n-gram in the above matrix. The last word of the n-gram that has the highest probability will be the best prediction outcome.

Back-off model

  • I used 80% of corpus as the training data, and 20% as the testing data. I use perplexity (the smaller the better) to evaluate the performance of my models.

  • The larger the n, the lower the perplexity, but for models with n>4, they are very computationally intensive with only small improvement in the perplexity. Therefore, I decided to set n=4, and build a back-off model.

  • My model will use the 4-grams matrix if the last three words you input can be matched, otherwise it will “back off” to a lower-order n-gram matrix, and all the way till it reaches unigram.

  • The back-off model is able to provide an outcome, even if the n-word phrase provided did not appear in the training corpus before.

Model Performance and Follow-ups

  • I tested my model using the 4-word phrases (4-grams) generated from the testing data, and the performance of the model is very good on those common phrases.

  • For the first 1,000 most frequent 4-grams, the average accuracy is 96%; For the first 10,000 4-grams, it is 85%; For all the 4-grams which appears more than 3 times (this gives us more than 100 thousand 4-grams), the average accuracy is still above 65%;

  • However, for those rare 4-grams (with frequency = 1) the accuracy will drop to about 25%.

  • Follow-ups: This model can be trained with more comprehensive corpus, and its performance will be further improved.