Coursera Data Science Capstone Presentation

Jorge Bretones Santamarina
31/10/2019

This project lies within the Natural Language Processing (NLP) field, a subfield of artifical intelligence concerned with the understading of language by machines.
The goal of this project is develope a NLP tool to predict the next word of a phrase given some context.
The training data provided consists of a compilation of text from 3 sources: twitter, blogs and news. The main features of the data are the following:

Models that assign probabilities to sequences of n-words (n-grams) are called N-gram models.
The goal is to compute the probability of a new word given some history. E.g.: \( \small P(word|history) = P(exam|he~studied~and~passed~the) \)
To simplify, we use the N-gram assumption, by which the probability of the next word only depends on the last n words (n-grams). E.g.: \( \small n = 2 \rightarrow P(exam|he~studied~and~passed~the) \approx P(exam|the) \)
We used Stupid Backoff to calculate the probabilities using up to 3-grams, if available. The formula is the following (for more information, see Brants et al., 2007):

We developed the following Shiny application with the algorithm:
To use it, go to https://jorgebs94.shinyapps.io/text_prediction_app/ and wait some time for the application to load.
Then, just type some text in the input bar and click on the predict button. The 3 most likely words and the word with the highest probability will be displayed below.

The application can be accessed at:
https://jorgebs94.shinyapps.io/text_prediction_app/
The Github repository with the code of this project can be found at https://github.com/JorgeBS94/Data_Science_Capstone

References

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Accessed from: https://web.stanford.edu/~jurafsky/slp3/3.pdf
Thorsten Brants, et al. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, pages 858-867, June.
Accessed from: https://www.aclweb.org/anthology/D07-1090.pdf