Predicting the next word

Jose A. Ruiperez Valiente
2016-04-17

Overview

What is it?

We use a Twitter dataset for training the algorithm
We clean the data set using Corpus objects and calculate the most frequent N-gram combinatios
We use those N-gram combinations to guess the most probable next word

User Interface

We kept the interface as simple as possible. Just a text field to introduce the text and a selection box to change the ngram complexity. We added an instructions and about tabs for info.

alt text

Algorithm

Read Twitter data
Clean data using Corpus (remove punctuation, numbers, etc)
Create TermDocumentMatrix using NGramTokenizer for [2,3,4]-grams
Store results and load them later to calculate the probabilities given the words

References

Some references we consulted:

tm package
N-gram wiki
Stackoverflow about N-gram
RWeka