Capstone Project - Data Science Specialization

Klever Mera

This App was created as a Capstone Project of the Data Science Specialization by Johns Hopkins University on Coursera.

This presentation will explain how the app was designed and how to use it in order to predict the next word using NLP.

Objective

This project creates a product to predict the next word based on a predictive algorithm and it will show the product in a Shiny Web application.

There were 8 tasks defined:

  • Understanding the problem, Data acquisition and cleaning, Exploratory analysis, Statistical modeling, Predictive modeling, Creative exploration, Creating a data product, and Creating a short slide deck pitching your product.

In this Slide Deck the model and the instructions will be presented.

Modeling

In this part, there are also some initial tasks which includes:

  • Tokenization: segment the string in individual words, quanteda library was used for tokenization. Also, cleaning tasks: removing numbers, punctuantion, symbols, hyphens, lemmatization, stemming and make sure the language is English.
  • Frequencies of the words: three n-gram modeling (1-gram, 2-gram and 3-gram).

For Modeling:

  • Markov Assumption was considered (next word depends only on the current word)
  • Back off Model: start with 3-gram, if there isn't enough evidence back off to 2-gram or 1-gram. Moreover, only blog file was considered and it was sampled using binomial. Besides, words with frequency equal to 1 were eliminated.

Instructions to use the app

  1. Wait till error message shows (10 to 20 seconds).
  2. Write only English words.
  3. The app will present 3 predicted words as a Mobile app.

Notes:

As soon as the first word is entered the app will predict.

You can keep writing as long as you want.

The error messages shows to tell you the app is ready and also to warn you to write just English words.

Additional Information