Word Prediction

Gregor M.
2018-06-07

Overview

The main goal of the presentation is to explain a simple next word perdition solution deployed on Shiny.io

The main goal of this presentation:

  • Give an overview on algorithm theory
  • Give insights on the base data set
  • Usage of shiny app
  • Outlook

Katz's back-off model

The Katz's back-off idea:

  • The probability of a word is calculated from a base data set
  • The more often a word is follow by another word the more likely it will be predicted
  • This logic get's applied to series of words
  • Formula:

Formel

where:

  • C(x) = number of times x appears in training
  • wi = ith word in the given context

Base data set

  • Data from twitter, news feeds and blogs( Download )
  • Size overview:
Source No.of.documents No.of.words File.size
Blogs 899288 37546246 255.4 Mb
News 77259 2674536 19.8 Mb
Twitter 2360148 30093410 319 Mb
  • For training and efficiency of the model we use random 1%

Usage of shiny app

Outlook

  • Algorithm was used to successfully answer coursera questions
  • Depending on the purpose the model could be rebuild without stop-words
  • Algorithm works properly but has potential to be improved
  • Changing to deep learning model. Next steps:
    • Using Long Short-Term Memory model (details)
    • Changing platform away from shiny.io due to limited resources