Shiny App for Next Word Prediction
https://elenena810.shinyapps.io/word_predictor/
Coursera's Data Science Specialisation CAPSTONE Project

author: Elena Civati
2021/07/05

  • 790000 English words recognized, in 50 millions of combinations
  • 24000 German words and 47000 Finnish words recognized
  • Mean accuracy= 46.52%
  • Mean response time= 0.10 seconds -> real-time prediction
  • Punctuation allowed and automatic capitalisation
  • Works on desktop and mobile devices
  • Profanity filter included

Features

  • My app is very simple to use: it consists of a webpage with an input box were the user types, pastes or modifies a sentence in English, including punctuation, and immediately gets a prediction of next word.
  • German and Finnish words can also be included, but prediction will be less accurate (useful for foreign words inside English sentences, not for sentences entirely in foreign languages).
  • Prediction is based on a dataframe search of the previous 4 words. If not possible (because less words are typed or because the app doesn't recognize that combination of words), the previous 3, then 2, then 1 words will be used for prediction (for foreign languages, prediction is with the previous 1 word by default).
  • When prediction fails (e.g. an unexisting word is typed), the word the is predicted (the most common word in the corpora used to create the algorithm).
  • Predicted words have correct capitalisation (based on grammar and puntuation rules) and when a dirty word is predicted, it is replaced in automatic with the tag <badword>.

Examples

Alt text

Alt text

Alt text

Alt text

Total size of files required to run the app is 308 Mb, so it's easy to share with other RStudio users.
When running, it requires 800 Mb of RAM (shinyapps.io metrics are shown):
Alt text

To test the app I used 853900 HC corpora sentences that weren't used to train the model, and for each I predicted the last word.
Mean prediction time according to the test (run in a non-interactive R session) was 0.0982 seconds.

Accuracy increases with the number of typed words. In detail:

  • with 1 word: 10.48%
  • with 2 words: 22.66%
  • with 3 words: 35.87%
  • with 4 words: 49.20%

Technical details

To build the app, the starting point was processing a training set containing the 70% of HC corpora and obtaining the most frequent n-gram for each (n-1)-gram (with n from 2 to 5). The key elements of my final algorithm are:

  • Conversion of words into integers
  • Clever incorporation of dataframes for prediction with 4, 3, 2 and 1 word in a single table
  • This table was splitted in 225 smaller tables to speed up the query
  • Some of those tables are always loaded into RAM, other are loaded only when needed
  • Objects are saved in RDS file format
  • Use of data.table package for searching

I encourage you to visit https://github.com/Elenena/NLP_Capstone where you can find more detailed explanations in the README file, download the results of test and look at my entire R code.