Capstone Project: Prediction of Next Word(s)

MAI
22 April 2016

This application (Prediction of Next Word(s)) was built using a simple back-off prediction algorithm.

About the Application

  • This presentation is part of the Data Science Capstone project.

  • The application provides an interface that can be accessed by others.

  • It takes as input phrase (single or multiple word(s) in a text box input) and provide output in the form of prediction of the next word.

  • For processing, the data is cleaned (by removing weird characters, empty spaces etc) and tokenized into n-grams.

  • These n-grams are then stored in term frequency matrices.

  • The matrices are then used for word prediction.

Simple Back-Off Algorithm

  • The customized Simple Backoff algorithm looks at the highest order n-grams matching the end of the inputed phrase, and, if needed, backs off with a discount to lower-order n-grams until a highest score match is found.

  • The following diagram illustrates the processing done using Simple Back-off :

alt text

Prediction of Next Word(s) Application

  • Users are required to enter the word or words that they want the system to predict on the left hand side of the screen.

  • Next, they need to select number(s) of predicted words and press the SUBMIT button.Phrase that is entered will appear at the PROCESS tab and predicted word(s) will be shown.

alt text

Conclusion

  • Throughout this project, I learn about Natural Language Processing and prediction algorithms.

  • It gives me first hand experience in dealing with “Big Data”.

  • The apps work at minimal performance due to small sample size (that is partly caused by the processing capacity of the machine used).