Coursera Data Science Specialization Capstone Project

Predict Next Word App - Using SwiftKey data

khsarma
21-Oct-2018

Data Science Capstone

This project is related to NLP. It involves analysing Swiftkey data - Text data feeds from Twitter, News and Blogs. Task is to take input data, create corpus, create N-grams and output next word predictions.

Input:

  • Corpus created based on Twitter, News and Blogs data.
  • User input of set of words (to predict the next word)

Output:

  • Shiny app was created to take user input of words - Next word prediction is given in the form a set of probable words.

Algorithm

  • After importing each of the files (Twitter, Blogs & News), consider a sample 10% of total data and build corpus files using tm library.
  • Using RWeka, create n-grams (bi, tri and quad grams)
  • Transform the corpus like removing urls, profane words, conversion to lower etc.
  • Calculate frequency of n-grams and sort them
  • Prediction function (explained in next slide) is applied and next word prediction output is obtained.

Prediction Model

Katz's Backoff model (Reference: Wiki) with Good-Turing Discounting is used for prediction. This model calculates the conditional probability of a word against preceding words.

  • After generating n-grams, Good-Turing algorithm is applied to the final corpus to obtain discount coefficient for each ngram

Good–Turing frequency estimation

  • Prediction function is created using Katz's Backoff algorithm. For each input set of words provided to the function, function uses word's frequencies and discounts to calculate probabilities of a word to appear at the end of the input text. KBO

KBO

Shiny App and Demo

Shiny App contains:

  • User input text box
  • submit button
  • Prediction results and Plot shown in tabs

Example: Step-1: User enters a set of words - “hello how are ” and hits Submit button

App

Step-2: User can find predicted words under “Prediction” tab.

App

It can be observed that word “you” is having higher probability of appearing next.

Step-3: User can also check plot of words under “Plot” tab.