Word Prediction Assignment

Remco Bekker
April 19th 2019

The assignment

  • Create a word suggestion algorithm that suggests 3 words a user would like to type in next based on what he/she has already typed in
  • Make sure it could run in a whatsapp-like context on for instance a mobile phone
  • Train the algorithm based on a provided corpus of data (and additional resources if convenient)
  • Create a shiny app to demonstrate you're algorithm and show it's effectiveness
  • Create a presentation to pitch the shiny app (which is this presentation!)

The start

3 text files were provided to be used for training the algorithm:

  • en_US.blogs.txt (approximately 900k lines and 3.7M words)
  • en_US.news.txt (approximately 1M lines and 3.4M words)
  • en_US.twitter.txt (approximately 2.3M lines and 3M words)

The first lines of the “en_US.blogs.txt” file look like…

[1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
[2] "We love you Mr. Brown."                                                                      

How the app works

  1. I started to explore the data (see exploratory analysis)
  2. Bigrams, trigrams and quadgrams were determined
  3. Ngrams with a frequency of 1 were discarded and the remaining ngrams were split in a context and a next word
  4. Of each ngram the conditional frequency per context was determined and only the top 8 ngrams per context were kept
  5. The Markov assumption is used as well as a backoff model
  6. Accuracy of the suggestion model was tested by holding out a sample of each of the 3 provided files
  7. For each word in the test samples only the top 3 suggestions were counted as a succes
  8. A success rate of 21% to 25% was achieved

How to use the app

Word suggestion APP in action!

Try it yourself!

Now try the app for yourself by clicking here!!!!