Predict Next Word App - Using SwiftKey data

khsarma
21-Oct-2018

Data Science Capstone

This project is related to NLP. It involves analysing Swiftkey data - Text data feeds from Twitter, News and Blogs. Task is to take input data, create corpus, create N-grams and output next word predictions.

Input:

Corpus created based on Twitter, News and Blogs data.
User input of set of words (to predict the next word)

Output:

Shiny app was created to take user input of words - Next word prediction is given in the form a set of probable words.

Algorithm

After importing each of the files (Twitter, Blogs & News), consider a sample 10% of total data and build corpus files using tm library.
Using RWeka, create n-grams (bi, tri and quad grams)
Transform the corpus like removing urls, profane words, conversion to lower etc.
Calculate frequency of n-grams and sort them
Prediction function (explained in next slide) is applied and next word prediction output is obtained.

Prediction Model

Katz's Backoff model (Reference: Wiki) with Good-Turing Discounting is used for prediction. This model calculates the conditional probability of a word against preceding words.

After generating n-grams, Good-Turing algorithm is applied to the final corpus to obtain discount coefficient for each ngram

Good�Turing frequency estimation

Prediction function is created using Katz's Backoff algorithm. For each input set of words provided to the function, function uses word's frequencies and discounts to calculate probabilities of a word to appear at the end of the input text.

KBO

Shiny App and Demo

Shiny App contains:

User input text box
submit button
Prediction results and Plot shown in tabs

Example: Step-1: User enters a set of words - “hello how are ” and hits Submit button

App

Step-2: User can find predicted words under “Prediction” tab.

App

It can be observed that word “you” is having higher probability of appearing next.

Step-3: User can also check plot of words under “Plot” tab.