DataScience Capstone Project : Predict the Next Word

Arind
15/10/2018

Introduction

The aim of the project is to create a shiny application that can predict to a certain degree the next word when a string of inputs is provided to it.

We are provided data from 3 sources viz. news, blogs and tweets.

  • Since the source data may contain many irregularities we therfore clean and process the data before creating a prediction model. A presentation of cleaning and interpretation of sample data is presented here.

  • Post cleaning and processing the data a model is built around the backoff algorithm to predict the next word based on the most frequently occuring word.

  • After the model is built which can predict words with some relevance, we make a shinny application for an UI

Basics of the Algorithm

A n-gram approach is taken to create the prediction algorithm, that is all the possible combinations of 2, 3, 4 and 5 words in a row (n-grams) are evaluated. The n-gram with the most frequency is arranged sequentially from top to bottom.

The steps that the algorithm is doing to predict a next word are:

  • Look for the last 4-gram in the sentence and its following word.
  • Look for the last 3-gram in the sentence and its following word.
  • Look for the last 2-gram in the sentence and its following word.

Get the most frequent prediction from all the n-grams or get a weighted prediction from all the n-grams.

Where can I find it?

The sinny applicationc an be found in the below provided link. Feel free to test it out.

https://arindm.shinyapps.io/Capstone-Project/

The app takes a few seconds to load as it is loading the dataset in the background.

Shinny App walkthrough

There is an input field where strings can be entered for to find a predicted words. In the left panel the no. of predicted words is controlled by a slider with range 1-10.

So set the slider and have a go!!!

Resources: