Data Science Capstone - Slide Deck

Juan Carlos Carmona Calvo
May 28th, 2019

NextWord is an App to predict the next word from a previously given text.
This prediction is done with a very simple Back-off algorithm.
The data provided consists of 3 files: news, blogs and Twitters. They are very extensive, therefore, we will work with a random sample of 5%.
We perform a very thorough cleaning and normalization of data samples.
We created the N-grams database in order to use our prediction algorithm.

If The text provided is empty, we use the 1-N-gram to propose the 3 most frequent words.
With the last 3 words of the text provided, we use the 4-N-gram to propose the 3 most frequent words.
With the last 2 words of the text provided, we use the 3-N-gram to propose the 3 most frequent words.
With the last word of the text provided, we use the 2-N-gram to propose the 3 most frequent words.
If after doing the previous searches we do not obtain any results, we use the 1-N-gram to propose the 3 most frequent words.

Initially we created several random samples of data: 5%, 10%, 20%, 30%, 40% and 50% and we also created 4 N-grams for each sample. Shinyapp.io's response times and memory limitations require us to use only one: 5%.
We create an initial Corpus for the sample.
We clean the Corpus eliminating: punctuation marks, symbols, separators, Twitter things, scripts, etc. We also use a profanity filter.
We proceed to Tokenize the Corpus, creating 4 lists/Tables N-grams.
We make these same transformations to the text provided before launching the prediction algorithm every time.

We load the 4 N-grams for the 5% sample, the only one we can use in shinyapp.io.
We can copy a text or write it directly in the text field, we will only make the prediction if in the end there is a blank space, with this we know that the last word has been written completely.
The prediction of the 3 most frequent words is done through 3 dynamically built buttons, when you press any of them, the App adds the word to the text automatically, making also the following prediction.
On the right we have information about the whole process: from sample, prediction, N-gram used, etc. At the end appears the transformed text that we use in the algorithm, and the full text with the best prediction.