PredictNextWord ShinyApp Presentation

KKher
9/19/2020

Project Overview

The goal is to create a product to predict next word & provide an interface that can be used by others.

Data used in this application comes from HC Corpora, and can be downloaded directly from here

HC corpora is a collection of corpora for various languages (mainly English, German, Finnish and Russian) freely available to download. The corpora have been collected from numerous different webpages.

Preprocessing & Modelling

  • Couple of R packages were used (mainly: tm, quanteda, igraph, and ggraph)
  • Sampling on data was done, only 10% of the data was used to build our model
  • Kneser–Ney smoothing technique is used to calculate the probability distribution of n-grams in a document based on their histories. Discounting value used in this project is 0.75. More about this technique can be check from here

Application

PredictNextWord Application helps the user with:

  • Get top 10 word predictions, upon submitting user's input
  • Graph of words related to user's input

PredictNextWord application can be reached through this Link.

Notes

  • The more data the better
  • Prediction algorithm can be enhanced, close followups with new and more efficient algorithms are required
  • Additional information can be viewed from this github repository