Data Science Capstone (Final Project Submission)

Mukesh Saxena
July 24,2018

Coursera Data Science Capstone Project
Source code :My Github
Shiny Apps :Prediction Words
R Pubs :Slide Deck

Project Assignment

Instructions

The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. For this project you must submit:

[1] A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

[2] A slide deck consisting of no more than 5 slides created with R Studio Presenter pitching your algorithm and app as if you were presenting to your boss or an investor.

Algorithm Function - Flowchart

Word Prediction Flowchart

Flow

Functionality of the text prediction algorithm:
Step 1: Remove punctuation, foreign characters, and profanity.
Step 2: Search for match, if matches are found, skip to Step 4.
Step 3: Shorten input, calculate a penalty value, search again.
Step 4: Calculate score for each match found, sort results.

Algorithm Function - Probability Model

Markov Assumption

The algorithm built for this project is based on the n-grams model with the Markov assumption. This is due to the speed returning predicted words and memory restrictions of the shinyApps website. In other words, we seek to find the word (Wn) which maximizes the conditional probability of (Wn) given its history.

Flow

If we use the “Stupid Backoff” to find more matches we apply a penalty to the log probability score.

User Guideline & Conclusion

How to use the application?

[1] Enter the beginning of a sentence in the text area.
[2] Choose the predicted choices you wish to be display.
[3] Hit the “Analyze Text” button to display the sentence completion predicted.

Conclusions

The prediction accuracy of the model evaluated on an out-of-sample validation set is approximately 20%. This is quite good for a model based only on n-grams, but some improvements may be possible in the near future.