Word Prediction - Shiny app

Yann Claudel
22/03/2017

Give me the next word

Synopsis

The purpose is to implement an algorithm that predicts an upcoming word given the first words of a sentence.

The algorithm will predict a list of potential words based on a corpus.
This corpus is built with 3 inputs files.
The content comes from twitter, blogs or news.

The algorithm uses n-grams, n-grams is a contiguous sequence of n items from a given sequence of text or speech.
The algorithm is based on Markow assumption:

The probability of a world depends only on the k previous world.

The algorithm

The algorithm searches in 5-grams,4-grams,3-grams,2-grams:

As the frequencies are always higher in 2-grams, then 3-grams, and so on
The frequencies of n-grams are ponderated as this:
The weight of a positive result in the n-gram is equals of the weight of all the positive (n-1)-grams
So if a positive result is found in 5-grams, its frequency is higher than the others and is placed in the head.

The application

alt text

See the application
https://yclaudel.shinyapps.io/appNextWord

The next step

What about the words that are unknown in the corpus ?
How to take in account in the algorithm the context, the sense of the phrase ? If the sentence is “I like swimming in the …” ,propose “garage” because “in the garage” is the most frequent 3-grams is not very helpful.
What is the effect on accurancy of the model if the stopwords are removed, or if the ends of sentence are taken in account.