Coursera's Data Science Specialization Capstone Milestone Report

Arturo Cardenas
August 23rd, 2015

About

Throughout the entire Coursera's Data Science Specialization we improved our Data Scientist Tool kit from learning how to install R to creating a Natural Language Processing model. This app is the cherry on top and emulates the behavior of predicting text tools - such as SwiftKey - and it's the Specialization Capstone project.

Link to Shiny App

This app is the result of an intensive data processing using KNIME, model developing using R and leveraging the advantages of RStudio & shinyapps

Under the hood

At the backend of the app there are 3 frequency tables with either 2, 3 or 4 N-Grams.

  1. The app converts your input to a length(vector) = 3 (filling it with NAs when needed e.g. c(NA, NA, "I")
  2. Using the data.table package, it find the next word through the N-Grams tables in a hierarchy approach:
  • First uses the 4-Gram to look for the next word
  • If there is no results, it goes to the 3-gram
  • If there is no results, it goes to the 2-gram
  • Extra Step: when there're no results, it wil go to the 2nd (or 3rd) to last word, to find a match in the datasets.
  • Finnally it will return the most frequent next word value

How to use this app

  1. Wait for a couple of seconds until the app is fully loaded
  2. Once the NA is displayed the app is ready to be used
  3. Start typing in the text box and see how the tool predicts the next word
  4. Please note that this tool only works in English

How KNIME saved my life

I used KNIME to pre-process the corpus. Once I understood the problem, I realized that the core work was going to create the best N-grams possible. For this task I created the following workflow that helped me process the entire corpus in just a couple of hours.

KNIME is “VBA macros on steroids”!