Word Prediction

Gregor M.
2018-06-07

Overview

The main goal of the presentation is to explain a simple next word perdition solution deployed on Shiny.io

The main goal of this presentation:

Give an overview on algorithm theory
Give insights on the base data set
Usage of shiny app
Outlook

Katz's back-off model

The Katz's back-off idea:

The probability of a word is calculated from a base data set
The more often a word is follow by another word the more likely it will be predicted
This logic get's applied to series of words
Formula:

$Formel$

where:

C(x) = number of times x appears in training
wi = ith word in the given context

Base data set

Data from twitter, news feeds and blogs( Download )
Size overview:

Source	No.of.documents	No.of.words	File.size
Blogs	899288	37546246	255.4 Mb
News	77259	2674536	19.8 Mb
Twitter	2360148	30093410	319 Mb

For training and efficiency of the model we use random 1%

Usage of shiny app

App location https://gregormatheis.shinyapps.io/text_prediction/
Please input your text on the left side
The app returns:
- Most likely word
- 3 most likely words
- Wordcloude plot with max 10 words

Outlook

Algorithm was used to successfully answer coursera questions
Depending on the purpose the model could be rebuild without stop-words
Algorithm works properly but has potential to be improved
Changing to deep learning model. Next steps:
- Using Long Short-Term Memory model (details)
- Changing platform away from shiny.io due to limited resources