Back-off n-gram text prediction

Hung Dinh
July 2, 2018

Project summary

This project builds a text prediction app, inspired by Swiftkey app and was selected as the final project of Data science Specialization by Johns Hopkins.

The app is expected to suggest the next word from some text input by the user.

In this presentation, I will summarize the key points of:

  • data processing
  • model prediction and validation
  • the final app

Data summary and processing

The data comes from 3 sources: English news, blog and twitter posts with more than 4 millions lines and more than 100 millions words.

With the training set (90% total size), I:

  • break lines into single words
  • remove unnecessary components (punctuation, profanity…)
  • create n-grams (n-consecutive words)
  • analyze each n-grams: the n-grams that appear more often would be the higher possible prediction.

For more details, please visit: http://rpubs.com/nhohung/NLP_processing

Prediction model and validation

I use a Back-off 4-grams prediction model as described below:

  • Keep 3 words from the user input
  • Suggest the next word from 4-grams. If there is no match, it is suggested from lower n-grams. NA is returned if there is no 2-grams match.

Validation: from the test dataset (10% total size), select random words (input) and their next one (ground truth). The model will predict 5 words from the input. If any of them matches the groud truth then the prediction is correct. My model shows 25% accuracy.

For more details, please visit: http://rpubs.com/nhohung/NLP_prediction

Introduction to final App

The final model is obtained with 3.7 MB of data that is required to load when the app starts. The response time is almost instant.

You can see the app here: https://nhohung.shinyapps.io/TextPrediction/.

When you're there, just start typing something into the top left box, 5 suggestions will be displayed immediately.