2 juillet 2019

Application Description

The application is very simple.

  • Enter a sentence or a sequence of words in the white textbox
  • Press the "Predict" button
  • A prediction for the word that best continues the series of words inputted appears after a few second on the right hand side of the window.

Preprocessing

Text data from blogs, tweets, and news in english are used for this project. Theses three data sources are sampled and preprocessed as follows:

  • URL are removed
  • Apostrophes are removed
  • Extra whitespaces are stripped
  • Everything is converted to lowercase and to UTF-8 encoding
  • Numbers are removed
  • Punctuation is removed
  • Profanities are filtered out

Then, N-grams of length 1 to 5 are computed and saved in a .Rda file.

Prediction

To find the words that best fits the input text, the application does as follows:

  • Process the input text the same way the text data was preprocesses
  • Find the largest N-grams that matches the last words of the input text
  • Extract the next word of each of these N-grams
  • Return the most frequent next word

The application starts by looking 5-Grams (if the input text is long enough), and if it does not find any that match the input, looks at 4-Grams, and so on until unigrams. If no N-grams matches the input text, the application assumes the end of sentence is reached and returns ".".

Application link

End

Thanks for your attention !