WordPred - Next Word Prediction App

ZOG
June 2017

Overview

  • WordPred was produced as a deliverable for the Johns Hopkins, Data Science Specialization Capstone Project.
  • The project was an exercise in Natural Language Processing and Text Mining with the ultimate goal of creating an application to predict the next word in a sentence.
  • A corpora of unstructured text from news articles, blog posts, and twitter messages was made available through the course website.
  • The project tasks involved data cleaning, exploratory analysis, predictive model building, application development and deployment.

Solution approach

  • A subset of the data from blog posts and twitter messages was used to construct the model. The News articles were not used in the final solution as the n-grams generated from them were not every-day, conversational type phrases.
  • R Package Quanteda was used to perform extraction of n-grams, and feature selection based on word frequency.
  • The n-grams and frequencies were then stored in data-tables providing fast indexed look-ups.
  • The final solution was implemented using a Four-gram Language Model, and the Stupid Backoff smoothing method.
  • The algorithm searches for the last 3 words of the user input in the 4-grams, if not found looks in the trigrams, then bigrams. If unsuccessful, returns the most frequent unigram for the next word.

Application features

  • WordPred was developed using Shiny and is hosted on shinyapps.io: https://max2016.shinyapps.io/WordPred/
  • Loads in 10 seconds and returns predictions instantaneously.
  • Upto 3 possible next words returned.
  • Features a simple, easy-to-navigate User Interface.

alt text

Conclusion

Due to resource limitations on the hosting service, trade-offs needed to be made between speed and accuracy. This app was optimized for speed at the expense of some prediction accuracy.

Planned Enhancements:

  • Coming soon: WordPred 2.0
  • Featuring Context, and Parts-of-Speech modeling for improved prediction accuracy.

Thank you for using the App.