Christos Tsolkas
November 15, 2020
NeXtWoRd is an application that tries to predict the next word of an English phrase
The application has a simple and intuitive user interface.
The application consists of three parts:
We have been supplied with a dataset Capstone Dataset containing twitter, news and blog data for four locales en_US, de_DE, ru_RU and fi_FI. Our model focuses on the English corpus.
We sampled 20% of the data and we applied profanity filtering and text normalization on them.
We then employed an N-gram Language Model. We generalize the bigram model (which looks one word into the past) to the n-gram (which looks n−1 words into the past) model. For better accuracy we've generated 1-grams (unigrams) up to 5-grams.
We implemented the stupid backoff algorithm. According to this algorithm if a higher-order n-gram has a zero count, we simply backoff to a lower order n-gram, weighed by a fixed (context-independent) weight (the creators of the algorithm found that a value of 0.4 works well in practice). The backoff terminates in the unigram frequency counts so it always gives a prediction (last resort are the most frequent unigrams).
The accuracy of the prediction algorithm is low when we predict only one word (around 15%-20%). So we allow for more predicted words for higher accuracy (35% with five predictions). For measuring accuracy we used a test set randomly sampled from our data (from all types).
The sorted tables of each n-gram model and the prediction algorithm are stored and made available in a shiny application.
The application utilizes a simple yet intuitive interface with a text area for entering an English phrase and a table with the next word predictions. The user can choose the number of predictions presented and click on the table to auto fill the prediction of the selected row.
Check it out yourself at: NeXtWord App and remember to have fun!