Coursera Data Science Capstone

Adham
February 17, 2017

The Project's Introduction

The goal is to create an app that takes on a prediction algorithm I built that uses NLP model giving the preceding bigram, unigram and trigram to predict the next word and make an interface that can be used by others. which was trainied based on samples of twitter feed, blogs and news.

It has the following features:-

Speed
Accuracy
Efficiency

The Project App Interface

A simple interface is used that requires the user to enter the text in a textbox input, the sidebarpanel contains a brief description of the app, the main panel then displays the predicted word.

my image

The Algorithm (1/2)

Calculating unigram probabilities:

\[ P( wi ) = count ( wi ) ) / count ( total number of words ) \]

Probability of wordi = Frequency of word (i) in our corpus / total number of words in our corpus

Calcuting bigram probabilities:

\[ P( wi | wi-1 ) = count ( wi-1, wi ) / count ( wi-1 ) \]

Probability that wordi-1 is followed by wordi = [Num times we saw wordi-1 followed by wordi] / [Num times we saw wordi-1]

The Algorithm (2/2)

Calculating trigram probabilities:

Building off the logic in bigram probabilities,

\[ P( wi | wi-1 wi-2 ) = count ( wi, wi-1, wi-2 ) / count ( wi-1, wi-2 ) \]

Probability that we saw wordi-1 followed by wordi-2 followed by wordi = [Num times we saw the three words in order] / [Num times we saw wordi-1 followed by wordi-2]

stupid backoff explaination reference: https://gist.github.com/ttezel/4138642