Data Science Capstone - Slide Deck
Juan Carlos Carmona Calvo
May 28th, 2019
2.- Introduction - App NextWord
- NextWord is an App to predict the next word from a previously given text.
- This prediction is done with a very simple Back-off algorithm.
- The data provided consists of 3 files: news, blogs and Twitters. They are very extensive, therefore, we will work with a random sample of 5%.
- We perform a very thorough cleaning and normalization of data samples.
- We created the N-grams database in order to use our prediction algorithm.
3.- Simple Back-off Algorithm
- If The text provided is empty, we use the 1-N-gram to propose the 3 most frequent words.
- With the last 3 words of the text provided, we use the 4-N-gram to propose the 3 most frequent words.
- With the last 2 words of the text provided, we use the 3-N-gram to propose the 3 most frequent words.
- With the last word of the text provided, we use the 2-N-gram to propose the 3 most frequent words.
- If after doing the previous searches we do not obtain any results, we use the 1-N-gram to propose the 3 most frequent words.
4.- Training & Predicting Database
- Initially we created several random samples of data: 5%, 10%, 20%, 30%, 40% and 50% and we also created 4 N-grams for each sample. Shinyapp.io's response times and memory limitations require us to use only one: 5%.
- We create an initial Corpus for the sample.
- We clean the Corpus eliminating: punctuation marks, symbols, separators, Twitter things, scripts, etc. We also use a profanity filter.
- We proceed to Tokenize the Corpus, creating 4 lists/Tables N-grams.
- We make these same transformations to the text provided before launching the prediction algorithm every time.
5.- App NextWord
- We load the 4 N-grams for the 5% sample, the only one we can use in shinyapp.io.
- We can copy a text or write it directly in the text field, we will only make the prediction if in the end there is a blank space, with this we know that the last word has been written completely.
- The prediction of the 3 most frequent words is done through 3 dynamically built buttons, when you press any of them, the App adds the word to the text automatically, making also the following prediction.
- On the right we have information about the whole process: from sample, prediction, N-gram used, etc. At the end appears the transformed text that we use in the algorithm, and the full text with the best prediction.