2023-05-17

What is the next word?

When we type texts, devices always suggest the next word given the texts we typed previously. There are many ways to achieve this task but in this project, using N-gram model and develop the application that can work in the environment with limited resources.

Shiny App URL: https://belanello.shinyapps.io/NextWordPrediction/

Algorithm

The underline data of the App is a table of conditional probabilities of words given their histories. Interpolation model uses the sum of weighted probabilities of unigram up to 4-gram for each word. Stupid back-off model simply search the history in 4-gram table first and if there is no match then search 3-gram table, continue till find the match. The App shows users top 3 words that are most likely. Below graphs show the probability distribution of the words after the sentence ‘What’s your favorite …’.

(Note that outputs omit UNK(unknown words) and EOS(end of the sentence) as predicted words).

Accuracy

The accuracy was tested on unseen 1000 sentences that

  • if the most likely word each model predicted was equal to actual word. (within_1)
  • if the actual word was within 3 words each model predicted. (within_3)
  • if the actual word was within 5 words each model predicted. (within_5)
[Accuracy on test set]
N.words within_1 within_3 within_5
Interporation 14271 18.57 % 31.48 % 37.3 %
Stupid_backoff 14271 4.63 % 32.42 % 38.6 %

The result shows that Stupid back-off model does slightly better but it tends to predict the exact word that appeared in higher order N-grams of training data. That is why most likely word(within_1) does not match the actual word of unseen data well comparing to Interpolation model which predicts more general words since it mixes all the N-grams.

Usage

  • Simply type your texts in the input box and predicted words will appear below as you type. When you click one of the word options, it will be added to your texts.
  • To clear input box, click ‘CLEAR’ button.
  • To compare 2 models, switch tabs.

Data