NeXtWoRd

Christos Tsolkas
November 15, 2020

NeXtWoRd - English word prediction app

NeXtWoRd is an application that tries to predict the next word of an English phrase

The application has a simple and intuitive user interface.

The application consists of three parts:

  1. A language model (in this part there is a heavy data pre-processing step)
  2. A prediction algorithm (a deterministic way of predicting the next word)
  3. A Shiny application (the user interface)

Language model

  • We have been supplied with a dataset Capstone Dataset containing twitter, news and blog data for four locales en_US, de_DE, ru_RU and fi_FI. Our model focuses on the English corpus.

  • We sampled 20% of the data and we applied profanity filtering and text normalization on them.

  • We then employed an N-gram Language Model. We generalize the bigram model (which looks one word into the past) to the n-gram (which looks n−1 words into the past) model. For better accuracy we've generated 1-grams (unigrams) up to 5-grams.

Prediction Algorithm

  • We implemented the stupid backoff algorithm. According to this algorithm if a higher-order n-gram has a zero count, we simply backoff to a lower order n-gram, weighed by a fixed (context-independent) weight (the creators of the algorithm found that a value of 0.4 works well in practice). The backoff terminates in the unigram frequency counts so it always gives a prediction (last resort are the most frequent unigrams).

  • The accuracy of the prediction algorithm is low when we predict only one word (around 15%-20%). So we allow for more predicted words for higher accuracy (35% with five predictions). For measuring accuracy we used a test set randomly sampled from our data (from all types).

Shiny application

  • The sorted tables of each n-gram model and the prediction algorithm are stored and made available in a shiny application.

  • The application utilizes a simple yet intuitive interface with a text area for entering an English phrase and a table with the next word predictions. The user can choose the number of predictions presented and click on the table to auto fill the prediction of the selected row.

  • Check it out yourself at: NeXtWord App and remember to have fun!

    • Kindly wait until the data are fully loaded (it approximately takes 3 to 5 secs to load - have a look at the progress bar on the down right corner of your browser window)