Coursera Data Science Capstone Project

Nikhil Prakash
May-19

This Application uses NLP(Natural Language Processing)for predicting next word.
The Capstone is a cooperation between Coursera and SwiftKey company.

Introduction:

The goal of this project is to create an application that predicts the next word in a phrase/sentence. Here we demonstrate the ability to process and analyze large volumes of unstructured text.Use text mining technique of cleaning, sampling, tokenization. And, As a final deliverable, we develop an algorithm that predicts the next word in a provided text, similar to the predictive text functions found on today's modern smart phones.

Below are the list of topic we will be discussing on the following slide:

Overview
Architecture: PredictNextWord
Application User Interface
Future possibilities & Conclusion

Overview

The data came from HC Corpora with three files (Blogs, News and Twitter). It was provided by the Swiftkey.
Major task involve in this project were:
– Obtain the data, Understands the problem and then clean the data accordingly.
– Perform Exploratory analysis.
– Tokenization of words and apply predictive algorithm.
– Create a interactive application using shiny.
NLP (N-Gram dictionary)
– For initial exploration, data analyst need to construct a dictionary of unigram, bigrams, trigrams, and four-grams, collectively called n-grams.
– Unigram are one word phrases, Bigrams are two word phrases, trigrams are three word phrases, and four-grams are four word phrases.

Architecture: PredictNextWord

The application uses text documents collected from blogs, news articles, and twitter to statistically model language patterns. N-Grams were used to predict the next word.

The 'PredictNextWord' Shiny app is a basic application to present the working of prediction model. It works only for English language.

The user entered the word,text or sentence in the input box and press space bar to get the next most probability word to be used.
Next word is predicted by the model will be displayed in the right side of the application along with the type of the N-gram (Bigram, Trigram, Quadgram) used in the search.
N-gram type is obtained from the n-grams matrices, comparing it with tokenized frequency of 2, 3 and 4 grams sequences.
While entering the text, the field with the predicted next word refreshes instantaneously, and then the predicted word is provided for the user's choice.

Application User Interface

Screenshot of the application user interface.PredictNextWord.

Future Possibilities & Conclusion

Areas of improvement:
– UI design of the app.
– Input data validation.
– Increase sample size for more relevant predictions. – Feedback loop to model to learn from the earlier prediction.
Conclusion:
– This project involve lot of research in data pre-processing, text modeling, NLP.
– All the skills gain throughout entire lifecycle of this specialization were used in this project.
– Entire specialization was very fun to learn and required ton of research which definitely increase my level of knowledge.