Predicting Next Word

Innuganti
Nov 29, 2018

Introduction

The Objective of this Capstone Project is to build an application that anticipates next word which can be deployed in a Shiny app. A smart word prediction Shiny app is built using transformed corpus of text files.

This project was the final project of Data Science Specialization by John Hopkins University on Coursera and it is an industry partnership with SwiftKey.

And data can be obtained from here.

Key Features

class: small-code Invloved the following steps:

Acquisition of data from Coursera
Only 0.2% of the blogs, news and twitter data sample was used to improve prediction speed and overcome memory limitations
Data wrangling for transforming and mapping (build unigram, bigram, trigram and quadgram models from the sampled data)
Use of Katz Backoff Algorithm
Application Development

Prediction Algorithm

Katz Backoff Model:

This is a non-linear method which allows us to calculate the conditional probability of a word against its history. This method follows 'Good Turning Discounting' means redistributing some probability of higher order N-gram to lower-order N-gram.

This algorithm uses quad gram if the evidence is sufficient, otherwise it uses trigram, otherwise bigram then unigram. We continue backing off until we reach a history that has some counts.

How to use App

User enters a word or patial phrase and press “Predict Next Word” button
The predicting word will appear in the “Predicted Next Word” field
Link to Word Prediction App
Link to Word Prediction Code (https://github.com/innuganti/Text_Prediction)

References

Katz's Back-off Model: Wikipedia

N-Grams: Stanford

Good Turning Smoothing