DataScience Capstone Project : Predict the Next Word

Arind
15/10/2018

Introduction

The aim of the project is to create a shiny application that can predict to a certain degree the next word when a string of inputs is provided to it.

We are provided data from 3 sources viz. news, blogs and tweets.

Since the source data may contain many irregularities we therfore clean and process the data before creating a prediction model. A presentation of cleaning and interpretation of sample data is presented here.
Post cleaning and processing the data a model is built around the backoff algorithm to predict the next word based on the most frequently occuring word.
After the model is built which can predict words with some relevance, we make a shinny application for an UI

Basics of the Algorithm

A n-gram approach is taken to create the prediction algorithm, that is all the possible combinations of 2, 3, 4 and 5 words in a row (n-grams) are evaluated. The n-gram with the most frequency is arranged sequentially from top to bottom.

The steps that the algorithm is doing to predict a next word are:

Look for the last 4-gram in the sentence and its following word.
Look for the last 3-gram in the sentence and its following word.
Look for the last 2-gram in the sentence and its following word.

Get the most frequent prediction from all the n-grams or get a weighted prediction from all the n-grams.

Where can I find it?

The sinny applicationc an be found in the below provided link. Feel free to test it out.

https://arindm.shinyapps.io/Capstone-Project/

The app takes a few seconds to load as it is loading the dataset in the background.

Shinny App walkthrough

There is an input field where strings can be entered for to find a predicted words. In the left panel the no. of predicted words is controlled by a slider with range 1-10.

So set the slider and have a go!!!

DataScience Capstone Project : Predict the Next Word

Introduction

Basics of the Algorithm

Where can I find it?

Resources: