Data Science Capstone Project: Natural Language Processing - Next Word Prediction App

P.Wang
December 10, 2014

App Introduction

Using natural langurage processing techniques, this application performs text mining and next word prediction given a phrase is enetered.

  • Data Source: This app examines the three sets of writing samples as the following: US Twitter: ~ 2.36 M tweets; US Blogs: ~ 0.9 M blogs; and US News: ~ 1 M news.

  • Data Processing: Data from the twitter, blogs and news are processed to create 3-, 4-, and 5-gram models. And the data are preprocessed with the steps to remove numbers, punctuations, whitespace, profanity, and changed to lowercase etc, to clean the data.

Algorithm

  • Algorithm is based on N-gram method.
  • 3-, 4-, and 5-gram models are built with the SwiftKey project data (word frequency >2).
  • The 3-, 4-, and 5-gram models are splitted into the first 2, 3, 4 words and the last word based on the input words length.
  • Only the last 4 words will be considered if the input words length > 4; and the prediction will be treated as 5-gram model.
  • Top 3 frequent predictions will be made in the order for the next word.
  • If nothing can be predicted, no prediction message will be displayed.

How to Use the App

  • On the app page, input your words in the “Enter your words” box. Wait for the entered words appear in the “You entered” box.
  • The top 3 frequent predictions for the next word will be given in the order in the prediction box. alt text

How to Use the App - continued

  • If nothing can be predicted, “–Sorry no prediction for your word–” will be displayed in the prediction box. alt text