Introduction

  • This is a presentation on a app which predicts next words from a given phrase.

  • When a user types a phrase in the apps input box the application returns 3 most frequent words that might follow the input phrase.

  • This application also assigns a weighting to suggest the likihood the prediction is correct using the stupid backoff algorithm.

Cleaning and building N-Grams

  • I used a sample of twitter, news and blogs data.
  • I then cleaned this data, which included removing profanities, numbers and symbols.
  • This data was then used to create four data sets of n-grams: tetra-gram(4 words phrases), tri-gram(3 words phrases), bi-gram(two words phrases) and uni-gram(1 word), which provided data on the n-gram and its frequency

Prediction model

I have used the stupid backoff algorithm with a lamda value of 0.4.

I will explain the algorithm via an example. Suppose we have the phrase “how are you”, the algorithm will first look at 4-grams to see if any start with “how are you”. Lets suppose there are 4-grams that start with “how are you”. In particular, there is one “how are you doing”, that appears 5 times. Then you would look at how many of the three gram “how are you” there are, suppose there are 10. Then you would give the prediction of the word “doing” a score of 5/10=0.5.

However if there was no 4-grams that start with “how are you”, you would then look at three grams that start with “are you” and repeat the process above. However, this is not likely to be nearly as good estimate for the word following “how are you”, as it would be if 4-grams existed, so you would multiply the scores by a value of 0.4.

This would then continue considering 2- gram if necessary.

How the app works

  • You can find the app here: https://emilyastone1.shinyapps.io/wordpredictor/

  • You need to enter the phase in the box, and then press the “Click for Predictions” button

  • The application will output the three top predictions based on the sample data.