Coursera Data Science Specialization Capstone Project

Ghazal Pasha
7/20/2018

The goal of this project is to build a model that predicts the next word.

Predicting the Next Word

In this project corpora collected from twitter, blogs, and news are used to make a predictive text model.

A Shiny application is built to demonstrate this prediction model.

You can find the application here:

https://ghazalp.shinyapps.io/PredictNextWord/

Steps

To build this model, natural language processing methods are used. These are the steps toward building this model:

Get and clean the data
Use tokenization technique
Explore the data and undrestand the features of the data
Build the N-grams
Build the predictive model
Test the model
Optimize the model for balanced run time and memory usage
Building the application to demonstrate the model

Algorithm

First a sample of the dataset is taken. The data got cleaned and tokenized and N-grams are built based on frequency of being used.

Next the input phrase is read and tokenized. Input is compared with the N-grams based on the length of the phrase. For example a two word phrase got compared with trigrams in the first levels and with bigrams in the next level.

The most frequent next words are added to the list of the suggestions. If encountered a new phrase. Most common words are suggested,

The suggestion list will be sorted based on probability and the top three suggestions for the next word are printed to the output.

Application Instructions and Preview

To use the app start typing in English and you will see the top three suggestions.

You can find the application here:

https://ghazalp.shinyapps.io/PredictNextWord/