Next Word Prediction

Vimal Simha
12-Aug-2020

Introduction

  • Data Science Capstone Project offerred by Johns Hopkins University through Coursera

  • Aim is to build a model to predict the next word in a sentence using Natural Language Processing (NLP) techniques

  • Develop a Shiny App as a user interface for the model

Data

  • Training data are collected from publicly available sources - blogs, news articles and twitter feeds using a web crawler and can be downloaded here.

  • Data are cleaned to remove punctuation, extra whitespace, numbers, profanities and tokenised into words.

  • Continuous sequences of words (n-grams) are extracted, their frequences are calculated and frequently occuring n-grams are indexed and saved.

Prediction Model

  • The next word prediction is based on Katz back-off model.

  • The last three words are used to predict the next word.

  • If there is no match above a likelihood threshold, the number of words considered is progressively shortened.

  • If no match is found, the algorithm returns the most commonly used single word.

  • Can be extended to include more words, correlations and sentiment analysis, but at the cost of speed and computational expense.

App And Code