Word Prediction Assignment

Remco Bekker
April 19th 2019

The assignment

  • Create a word suggestion algorithm that suggests 3 words a user would like to type in next based on what he/she has already typed in
  • Make sure it could run in a whatsapp-like context on for instance a mobile phone
  • Train the algorithm based on a provided corpus of data (and additional resources if convenient)
  • Create a shiny app to demonstrate you're algorithm and show it's effectiveness
  • Create a presentation to pitch the shiny app (which is this presentation!)

The start

3 text files were provided to be used for training the algorithm:

  • en_US.blogs.txt (approximately 900k lines and 3.7M words)
  • en_US.news.txt (approximately 1M lines and 3.4M words)
  • en_US.twitter.txt (approximately 2.3M lines and 3M words)

The first lines of the “en_US.blogs.txt” file look like…

[1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
[2] "We love you Mr. Brown."                                                                      

The process

  1. I started to explore the data and read up on natural language processing to learn how to approach the problem. The result of this exploration can be found here.
  2. It turned out that it was handy to turn the data into ngrams (bigrams, trigrams and quadgrams), to use the Markov assumption, a back-off model and conditional frequencies to rank the ngrams.
  3. 2 quizzes required that the complete text files were processed which resulted in some challenges with the amount of data.
  4. Next up was creating this presentation and the shiny app to demonstrate the algorithm.

The result

Word suggestion APP in action!

Try it yourself!

Now try the app for yourself by clicking here!!!!