Introducing The Next Word Prediction APP

Roger Hu
11/3/2019

Introduction

The main goal of this project is to build a Shiny application to predict the next word based on the immediate preceding words.

  • Training Data: the original corpora is provided by the company Swiftkey and can be accessed here
  • The original corpora is consisted of over 4.2 million lines of English text from three sources: Twitter, blogs, and news articles. For practical reasons, 10% of lines from original corpora were randomly sampled and used to build the prediction model for this project.
  • The Next Word Prediction App is hosted on Shiny.IO server

Algorithm

After the original corpora is sampled and processed (text cleaning and stemming):

  • quanteda package is used to create the N-gram model. N = 3 or tri-gram model are created for this particular application
  • the 'Modified Kneser-Ney Smoothing' is applied on the tri-gram model
  • data.table' package is used for performing calculations and retriving data/making predictions based on the smoothed tri-gram model
  • the model uses the two immediately preceding words as the “base” for making the prediction of the next following word

App Interface

title

How to Use the App

  • Start typing into the input box located on the left side of the app
  • Words used for making the prediction and the top 5 predicted following words are displayed on the right side
  • Please note that for both user's input and the training text data:
    • English stop words such as are removed
    • Words are stemmed

References