2023-11-27

Predicting the next word - Capstone

The goal of this challenging project is to developed a Shiny app that is able to predict the next word based on the previous 3,2 or 1 words, simular to the SwiftKey word predictor.

The unique feature of my app is, that it is made bilingual. This demonstrates that the prediction model can be reused for different languages depending on the training data set.

Here is the result of my journey.

The Predictive Model

  • The NLP model uses a dictionary of 4-grams/3-grams/bigrams and unigrams (bag of words) with frequencies to predict the next word. These ngrams were made using Quanteda and tokenization of words.

  • Experimented with the trade-off between performance and accurancy of the model by using different sample sizes, stemming of words, removing punctuation and stopwords.

  • The probability prediction is based on the Maximum Likelihood Estimation of the ngram in the training set.

  • If a set of words is not found in the higher ngram model, the input string is depreciated to n-1 (Back-off strategy). There is no discounting method implemented (backweight = 1).

Performance Summary

The accuracy of the model can be improved by adding a Good-Turing or Kneser-Ney smoothing technique. Moreover, increasing the training set wil also improve the accuracy.

Final words

The app does not need any further explanation. It is easy to use.

Only words frequencies > 1 were used to reduce the size of the dictionaries and improve the user experience. Data tables were used for quick retrieval of data. Still it takes some time changing dictionaries between English and German when using the app.

Have a try for yourself via

https://tmq-project.shinyapps.io/WordPredictor/

Thanks and have a good day!