MOOC Data Science Predictor! - This application is all about my genious crystal ball!

Velladurai Balakrishnan
22 April 2016

Introduction

As part of the Coursera Data Science Capstone Project, this peer assessment project work presentation deck is created. This project work consist of has two parts. Firstly, we need to create a Shiny application and deploy it on Rstudio's servers. Secondly, we should use Slidify or Rstudio Presenter to prepare a reproducible pitch presentation about the application. This presentation adresses the second part of the course project.

  • continue with this app developed for the first part of the assignment is avalilable at:

https://durai.shinyapps.io/Capstone/

MOOC Data Science Predictor!

This app is designed to predict the most likely word based on the frequency of occurance in the n-grams.

This app consist of 2-grams to 5-grams. It was drawn from approx. 12k lines of English words & sentenses from Twitter, News Articles & Blogs.

Upon loading the datasets, a series of methods were applied to Clean the Data. All punctuations, numbers & bad words were removed, only symbol spared were apostrophes (').

This app requires us to load appoximately over 0.5GB of data. This post huge challenge in loading, cleaning and analysing it during the initial stage. creating the N-Grams and Corpus was also a challenge. Hope that the app meets the requirements of the module.

How to use this app?

The app is quite ease to use.The steps were as follows:.

1. Choose English Language from a dropdown menu.

2. Decide whether or not to Choose 'Traditional' or 'Improved Version' mode.

a: Traditional Mode-Requires user to key in the phrase and click on 'Click for next word!' button. The app will suggest the most suitable word based on the frequency from the datasets.

b: Improved Version Mode-Requires user to key in the phrase and theapp will automatically suggest the most suitable word based on the frequency from the datasets.
alt text

Algorithm used behind the screen

Stupid Backoff algorithm by Brants et all (2007) was used as it load fast and performs as close as Kneser-ney smoothing. User input: "The CAT, in the ?". What the algo does is:.

* Cleans up and standardizes the input, turning it into: "the cat in the".

* Check 5-gram data for all occurrences of the cat in the *, where * denotes any word. Similarly, check 4-gram data for cat in the *, check 3-gram data for in the *, check 2-gram data for the *. Make a list of all the * candidate words.

* For each word, find maximum likelihood estimate (MLE) in the corresponding n-gram, compute overall score (set \( \alpha \) = 0.4), and produce a score table. Below are the first 3 rows of an example score table, which show that hat followed 100% of the the cat in the instances in the 5-gram.

nextword n5.MLE n4.MLE n3.MLE n2.MLE score
hat 100 100 0 0 100.0
unk 0 0 3 4 0.5
first 0 0 2 1 0.3

* Remove any unk rows. (since unk was a placeholder for words that only appeared once).

* Output the word with the top score. If there are multiple words with the same top score, randomly pick one. If user turns on safe mode and the output word is a profanity, censor the output.