Data Science Capstone Final Project Submission

1st August 2020

Word prediction using Katz backoff algorithm


This project is made with dedication to learn prediction of words with the help of natural language processing and choosing the best algorithm for the prediction.

Link to the app for the prediction model:

Getting and cleaning data

  • The data used in the model is the data provided by the John Hopkins Univeristy.SInce the dataset was to large and was taking time to be processed we have subset the data to 10% using rbinom function.

  • The data has been tokenized using tm_map package and profanity words have been removed to enhance the output produced.

Prediction model

The data was first subdivided into ngrams and bigram and trigram were processed and smoothened to be used in the predictive algorithm.

The predictive algorithm used in the model is Katz backoff model

  • For prediction of the next word, Trigram is first used (first two words of Trigram are the last two words of the user provided sentence).
  • If no Trigram is found, back off to Bigram (first word of Bigram is the last word of the sentence)
  • If no Bigram is found, back off to the most common word with highest frequency 'the' is returned.

Shiny application

alt text

Thank You