Data Science Capstone Word Prediction

JMFlin

Introduction

The purpose of this project is to build a natural language model that predicts the next unseen word in a user specified sentence input. Three types of data including twitter, news and blogs were used to train the model.

Data Preprocessing

The following steps were performed to clean the data files

  • A subset of the original data was randomly selected from the three sources and merged into one.
  • Data cleaning involved converting to lower case, removing punctuations, numbers and unneccesary white spaces.
  • n-grams: 1:4-words were then created using the twitter, blog and news data.
  • The four n-grams were sorted and saved after having calculated their cumulative frequencies and removed low frequency observations.

Prediction Model

The next word prediction model is based on the Katz Back-off algorithm.

  • Clean the user specified sequence of words by converting to lower case, removing punctuations, numbers and unneccesary white spaces.
  • Depending upon the number of words specified by the user, extract last word based on the n-grams.
  • First use a 4-gram and if no match is found then do the same for 3-gram etc.
  • If no match is found from the n-grams (n>1), then use the most frequent word from the corpus.

Shiny Application

A Shiny application was developed based on the next word prediction model described previously.

  • User enters a sequence of words in the text box, then presses the button.
  • The predicted next word is displayed with a text showing which n-gram was used for the prediction.
  • User entered sentence is also displayed in the app.