Predicting the next word

Jose A. Ruiperez Valiente
2016-04-17

Overview

What is it?

  • We use a Twitter dataset for training the algorithm
  • We clean the data set using Corpus objects and calculate the most frequent N-gram combinatios
  • We use those N-gram combinations to guess the most probable next word

User Interface

We kept the interface as simple as possible. Just a text field to introduce the text and a selection box to change the ngram complexity. We added an instructions and about tabs for info.

alt text

Algorithm

  1. Read Twitter data
  2. Clean data using Corpus (remove punctuation, numbers, etc)
  3. Create TermDocumentMatrix using NGramTokenizer for [2,3,4]-grams
  4. Store results and load them later to calculate the probabilities given the words

References

Some references we consulted: