PredictNextWord ShinyApp Presentation

KKher
9/19/2020

Project Overview

The goal is to create a product to predict next word & provide an interface that can be used by others.

Data used in this application comes from HC Corpora, and can be downloaded directly from here

HC corpora is a collection of corpora for various languages (mainly English, German, Finnish and Russian) freely available to download. The corpora have been collected from numerous different webpages.

Preprocessing & Modelling

Couple of R packages were used (mainly: tm, quanteda, igraph, and ggraph)
Sampling on data was done, only 10% of the data was used to build our model
Kneser–Ney smoothing technique is used to calculate the probability distribution of n-grams in a document based on their histories. Discounting value used in this project is 0.75. More about this technique can be check from here

Application

PredictNextWord Application helps the user with:

Get top 10 word predictions, upon submitting user's input
Graph of words related to user's input

PredictNextWord application can be reached through this Link.

Notes

The more data the better
Prediction algorithm can be enhanced, close followups with new and more efficient algorithms are required
Additional information can be viewed from this github repository