Text Prediction:

Ken Mwai
Data Science Specialization:Capstone Project

Introduction

  • The main objective of this project is to to create an algorithm to predict the next word based on the previous words typed by a user.

  • We utilise data sets from a corpus called HC Corpora.

  • We use the Natural Language Processing algorithms to work on the prediction.

Model

We use the Katz BackOff Model for data prediction

The initial step in model training and building is learning about the n-grams in the data set. I focused on

  • 1-grams - Uni grams, single words
  • 2-grams - Bi grams - word pairs
  • 3-grams - Tri grams - word triplets
  • 4-grams - quad grams - word quadruplets

Each n-gram was then broken down to a data frame and then computed the conditional a-posterior probabilities, using the naiveBayes() function.

Application and prediction

A shiny application is created where the user inputs a phrase and the application makes either two or single prediction as per the users option. The prediction algorithm

  • Examine the final n-words (up-to to 3 maximum) inserted by the user. If the phrase was present in the training data it predicts the most common next word.If not continue
  • If not it predicts the most common word in the training data
  • If the user had selected predict two words, the app predicts using 2-gram and 3-gram and outputs both words in Bi-grams and Tri-grams

Strengths and improvements

The word prediction is already pre-calculated and stored. So the prediction is quick. The application shows predictions in both bi-grams and tri-grams so it gives the user two choices to select from

Improvements

  • Check on the whole sentence but not just the last 3 or two words
  • Offer a score for the most common word
  • Predict more than two words.