Next word prediction using Natural Language Processing Model

Arouna Mopa
April 16, 2016

Data Science Capstone Project - Coursera/JHU

Project background

This project used the N-GRAMs model to build an algorithm to suggest the next word, given a text enter by the user as input. The data source used in this project contains three types of data including twitter, news and blogs. After cleaning and sub-setting data to build the training model, an N-gram model was created and a predictive algorithm (Katz Back-off) was applied to predict next word.

The final data product model was published as a Shiny application. This application can be found online using the links bellow:

Data product on Shiny.io

Project Software

How does the Model Predict the Next Word ?

In an N-gram model, the length of the history is N-1. For example In a 2-gram model the length of the history is 1. In a 3-gram model, the length of the history is 2.

The prediction of the next word is based on the computation of the probability of a phrase which is approximated by the ratio of the number of times the phrase occurred divided by the number of times the history occurred:

p(Good morning everybody) = #(Good morning everybody)/ #(Good Morning).

How does the application look like?

The left panel on the home page has an input box where the user can enter a text.
As the content of the imput box changes, the suggested next words are displayed in the right panel.
This application can suggest up to five next words ranged in decreasing order of the probality of occurrence.

alt text

Next word prediction using Natural Language Processing Model

Project background

How does the Model Predict the Next Word ?

How does the application look like?

Resources