August 26, 2018

Next-Word Prediction Model

  • Data sets - the data sets were provided by SwiftKey from blogs, news, and Twitter sources;

  • Pre-processing - the text has been pre-processed using the most common text analysis techniques, part of the text processing pipeline - lower case, remove punctuation, remove stop and profanity words;

  • Tokenization, DFMs, N-grams - converted each word to token (tokenization), created document frequency matrix (DFM) with the frequency of each token, created combinations of two, three, and four tokens (N-grams) to improve the prediction model;

  • The model - a sequence of R functions, which take the input (word), pre-porcess it, run it through the diferent N-gram-s, and return the most frequent word, related to the input.

Next-Word Prediction App

  • How to use - the user needs to enter the word or the phrase in the corresponding field;

  • The algorithm - the word will be pre-processed and matched to the available data sets. The app will return the next most related word, based on its frequency.

  • Link to the app on ShinyApp.io:

https://kaloatanasov.shinyapps.io/NextWordPredictApp_v2/

Note: Snapshot of the app, with detailed instructions, is available on the next slide.

Next-Word Prediction App Instructions

JHU and SwiftKey credits

  • Johns Hopkins University - the professors Roger D. Peng, PhD, Brian Caffo, PhD, Jeff Leek, PhD, and their team have been very helpful in providing checkpoints and milestones along the way, in order to help us (the students) keep track of the Capstone Project progress;

  • SwiftKey - the company has been of a great help in providing one of the most important factors in Data Science - the data sets.

Personal note: The willingness and dedication of everyone mentioned above, to help with the development of new and knowledgeable Data Scientists, is highly appreciated!