9/17/2021

The App

This project uses a stupid backoff algorithm to predict the next word of a phrase. To use simply enter your phrase in the text box and click submit!

This is what it looks like in action!

The Algorithm

  • The algorithm is based off of bayes theorem. Essentially it scores each possibility from the observed sequences of next words by dividing the number of observations of the given sequence with that word by the number of times the given sequence occurs.
  • Of course in order to make longer sequences viable we discount the shorter sequences for each word shorter they are.
  • Finally if all else fails we use a random selection from the most commonly observed words.
  • You can read more about this and more sophisticated algorithms at this article

Limitations

  • Despite extensive use of the quanteda package in R and related packages which greatly sped up processing of this data, computer limitations made it so I could only use 15% of the data we are provided. This and pretty much any other algorithm would be greatly enhanced in functionality if they could access more data.
  • The Stupid Backoff Algorithm is remarkable mostly because of its firm foundation in math and strong intuition. The simplicity of this algorithm made it possible for me and many others to program it from scratch. However, it also has Limited predictive power especially when it is used on out of sample data.

Conclusion

  • Thank you for sticking though this course with me and for reading this presentaion!
  • Full credit to the team behind the data Science specialization on Coursera by JHU. I learned many of the skills and tools i used from them.
  • Also thank you for all those involved in the quanteda and data.table packages in R without them this project would have been extremely difficult!