Roberto Preste
2018-12-30
The aim of the Data Science Capstone project was to create a simple (but functional) Shiny App that would predict the next word from a given phrase.
The task was definitely interesting and stimulating to me, since it involved delving into a new subject, NLP, understanding its peculiarities and creating an efficient modeling application. I also had to take care of sampling the training data, in order to allow the application to run smoothly and without great delay, while still offering coherent results.
The app I created is available on shinyapps.io.
Documentation can be found on GitHub.
The pitch presentation for the app is on rpubs.com.
The data used in this project come from a corpus of sentences in four different languages gathered from blog posts, news and tweets, and is available from this link. I chose to use the English subset; some examples of these sentences are:
"How are you? Btw thanks for the RT. You gonna be in DC anytime soon?"
"If you have an alternative argument, let's hear it! :)"
"He wasn't home alone, apparently."
Before using any modeling approach, these data had to be cleaned and transformed in a suitable way. This meant removing punctuation, symbols and stop words, as well as dropping profanities, based on a list available on this GitHub repository.
The clean corpus was used to train a model that is be able to predict the next word in a given sentence. I used a Katz back-off model, which is able to estimate the conditional probability of a word given its history in the preceding n-gram (set of words). This means that the prediction of the last word of a sentence is made based on the previous 1, 2 or n words.
To build this model, we need to calculate the probabilities of encountering a specific word or a combination of words, using the available training corpus. In such a model, however, one has to take into account also unseen n-grams, i.e. set of words that are not observed in the training data.
This is done using discounting, which, simply stated, means retaining some of the probability mass from the observed n-grams and redistributing it to unobserved n-grams.