Data Science Capstone - Slide Deck

Juan Carlos Carmona Calvo
May 28th, 2019

App: http://juanky201271.shinyapps.io/DSC-CP-JCCC-ENG/.

Repo: https://github.com/juanky201271/cp-data-science-capstone/.

2.- Introduction - App NextWord

  • NextWord is an App to predict the next word from a previously given text.
  • This prediction is done with a very simple Back-off algorithm.
  • The data provided consists of 3 files: news, blogs and Twitters. They are very extensive, therefore, we will work with a random sample of 5%.
  • We perform a very thorough cleaning and normalization of data samples.
  • We created the N-grams database in order to use our prediction algorithm.

3.- Simple Back-off Algorithm

  • If The text provided is empty, we use the 1-N-gram to propose the 3 most frequent words.
  • With the last 3 words of the text provided, we use the 4-N-gram to propose the 3 most frequent words.
  • With the last 2 words of the text provided, we use the 3-N-gram to propose the 3 most frequent words.
  • With the last word of the text provided, we use the 2-N-gram to propose the 3 most frequent words.
  • If after doing the previous searches we do not obtain any results, we use the 1-N-gram to propose the 3 most frequent words.

4.- Training & Predicting Database

  • Initially we created several random samples of data: 5%, 10%, 20%, 30%, 40% and 50% and we also created 4 N-grams for each sample. Shinyapp.io's response times and memory limitations require us to use only one: 5%.
  • We create an initial Corpus for the sample.
  • We clean the Corpus eliminating: punctuation marks, symbols, separators, Twitter things, scripts, etc. We also use a profanity filter.
  • We proceed to Tokenize the Corpus, creating 4 lists/Tables N-grams.
  • We make these same transformations to the text provided before launching the prediction algorithm every time.

5.- App NextWord

  • We load the 4 N-grams for the 5% sample, the only one we can use in shinyapp.io.
  • We can copy a text or write it directly in the text field, we will only make the prediction if in the end there is a blank space, with this we know that the last word has been written completely.
  • The prediction of the 3 most frequent words is done through 3 dynamically built buttons, when you press any of them, the App adds the word to the text automatically, making also the following prediction.
  • On the right we have information about the whole process: from sample, prediction, N-gram used, etc. At the end appears the transformed text that we use in the algorithm, and the full text with the best prediction.