Next word prediction using Natural Language Processing Model

Arouna Mopa
April 16, 2016

Data Science Capstone Project - Coursera/JHU

Project background

This project used the N-GRAMs model to build an algorithm to suggest the next word, given a text enter by the user as input. The data source used in this project contains three types of data including twitter, news and blogs. After cleaning and sub-setting data to build the training model, an N-gram model was created and a predictive algorithm (Katz Back-off) was applied to predict next word.

The final data product model was published as a Shiny application. This application can be found online using the links bellow:

Data product on Shiny.io

Project Software

How does the Model Predict the Next Word ?

In an N-gram model, the length of the history is N-1. For example In a 2-gram model the length of the history is 1. In a 3-gram model, the length of the history is 2.

The prediction of the next word is based on the computation of the probability of a phrase which is approximated by the ratio of the number of times the phrase occurred divided by the number of times the history occurred:

p(Good morning everybody) = #(Good morning everybody)/ #(Good Morning).

How does the application look like?

  • The left panel on the home page has an input box where the user can enter a text.

  • As the content of the imput box changes, the suggested next words are displayed in the right panel.

  • This application can suggest up to five next words ranged in decreasing order of the probality of occurrence.

alt text

Resources