Coursera Data Science Capstone Project

Nikhil Prakash
May-19

  • This Application uses NLP(Natural Language Processing)for predicting next word.
  • The Capstone is a cooperation between Coursera and SwiftKey company.

Introduction:

The goal of this project is to create an application that predicts the next word in a phrase/sentence. Here we demonstrate the ability to process and analyze large volumes of unstructured text.Use text mining technique of cleaning, sampling, tokenization. And, As a final deliverable, we develop an algorithm that predicts the next word in a provided text, similar to the predictive text functions found on today's modern smart phones.

Below are the list of topic we will be discussing on the following slide:

  • Overview
  • Architecture: PredictNextWord
  • Application User Interface
  • Future possibilities & Conclusion

Overview

  • The data came from HC Corpora with three files (Blogs, News and Twitter). It was provided by the Swiftkey.

  • Major task involve in this project were:
    – Obtain the data, Understands the problem and then clean the data accordingly.
    – Perform Exploratory analysis.
    – Tokenization of words and apply predictive algorithm.
    – Create a interactive application using shiny.

  • NLP (N-Gram dictionary)
    – For initial exploration, data analyst need to construct a dictionary of unigram, bigrams, trigrams, and four-grams, collectively called n-grams.
    – Unigram are one word phrases, Bigrams are two word phrases, trigrams are three word phrases, and four-grams are four word phrases.

Architecture: PredictNextWord

The application uses text documents collected from blogs, news articles, and twitter to statistically model language patterns. N-Grams were used to predict the next word.

The 'PredictNextWord' Shiny app is a basic application to present the working of prediction model. It works only for English language.

  • The user entered the word,text or sentence in the input box and press space bar to get the next most probability word to be used.
  • Next word is predicted by the model will be displayed in the right side of the application along with the type of the N-gram (Bigram, Trigram, Quadgram) used in the search.
  • N-gram type is obtained from the n-grams matrices, comparing it with tokenized frequency of 2, 3 and 4 grams sequences.
  • While entering the text, the field with the predicted next word refreshes instantaneously, and then the predicted word is provided for the user's choice.

Application User Interface

Future Possibilities & Conclusion

  • Areas of improvement:
    – UI design of the app.
    – Input data validation.
    – Increase sample size for more relevant predictions. – Feedback loop to model to learn from the earlier prediction.

  • Conclusion:
    – This project involve lot of research in data pre-processing, text modeling, NLP.
    – All the skills gain throughout entire lifecycle of this specialization were used in this project.
    – Entire specialization was very fun to learn and required ton of research which definitely increase my level of knowledge.