Datascience_Capstone

Amit Shinde
December 2016

Overview

The purpose of this Capstone project was to create a Shiny app which takes an input text or phrase and predicts the next word.

Predictive text model is based on HC Corpora corpus which can be downloaded from this website.

The purpose of Natural Language Processing (NLP) is to predict the next word based on an input text or a phrase. For e.g this process can be seen in action by typing on phone keyboard or searching on the web.

Prediction Model

Reads data from blogs, twitter and news feed datasets.

Cleans it by removing special characters, whitespaces, numbers, punctuations, stopwords, change to lowercase, etc.

Generates unigram,bigram,trigram and quadgram files using the frequency of word/s.

Creates small datasets to speed up the performance.

How Does This App Work?

It is as simple as typing text or phrase and then clicking on the “Submit” button.

Input data will be cleansed.

Algorithm will be executed to predict the next word based on quadgram, trigram, bigram and unigram dataframes in order.

Execution time will be caluculated in milliseconds.

Next word will be listed in choices based on it's prediction score. Max 5 words or phrases will be shown in list.

Conclusion

Shiny app snap

Click on this Shiny App Link
https://amitshinde.shinyapps.io/DataScience_Capstone/

The course was long but totally worth it; the credit goes to Roger D. Peng, Brian Caffo and Jeff Leek from JNU