Word Prediction - Shiny app

Yann Claudel
22/03/2017

Give me the next word

Synopsis

The purpose is to implement an algorithm that predicts an upcoming word given the first words of a sentence.

The algorithm will predict a list of potential words based on a corpus.
This corpus is built with 3 inputs files.
The content comes from twitter, blogs or news.

  • The algorithm uses n-grams, n-grams is a contiguous sequence of n items from a given sequence of text or speech.
  • The algorithm is based on Markow assumption:

The probability of a world depends only on the k previous world.

The algorithm

The algorithm searches in 5-grams,4-grams,3-grams,2-grams:

  • As the frequencies are always higher in 2-grams, then 3-grams, and so on
  • The frequencies of n-grams are ponderated as this:
  • The weight of a positive result in the n-gram is equals of the weight of all the positive (n-1)-grams
  • So if a positive result is found in 5-grams, its frequency is higher than the others and is placed in the head.

The application

The next step

  • What about the words that are unknown in the corpus ?
  • How to take in account in the algorithm the context, the sense of the phrase ? If the sentence is “I like swimming in the …” ,propose “garage” because “in the garage” is the most frequent 3-grams is not very helpful.
  • What is the effect on accurancy of the model if the stopwords are removed, or if the ends of sentence are taken in account.