Data Science Capstone - Slide Deck

TizVic
2018-02-04

Motivation

This presentation is about the App created for Coursera Data Science final course.

The app Word Oracle predicts the next word based on previous ones.

The app displays only the best macth for word prediction, as requested by assignement specification. If the word predicted is correct, it can be added to typed text with a simple double click of the mouse.

The app is minimal and is designed to be used by smartphones and to simplify the user interface and user experience as much as possible. You can see a screenshot in page 5

Making of

To create the final algorithm of the app I compared two very different approaches:

On a sample of \( 1000 \) sentences taken from the dataset, the approach with the RNN allows a greater precision (\( 27.3\% \)) than the Backoff approach (\( 18.1\% \)) but requires a much greater computing power and in this specific application the user experience is more important than the precision, so the second approach was chosen.

Algorithm description

When some text is inputted, the algorithm does the following steps:

  • clean text. Remove stopwords, expand contractions, remove spaces.
  • Calculate probability for 2-gram, and if applicable of 3-gram and 4-gram.
  • Weight probability with Good-Turing frequency of frequencies table
  • Return the best match that will be displayed

App description

The Words Oracle app interface is very simple:

This is Word Oracle App
In the first textbox you can paste the text that will be used for word prediction. In the second textbox will be displayed the more probable matched word. Interface uses JavaScript to calculate the prediction while typing.