Johns Hopkins Data Science Capstone Project Report : Next Word Prediction Shiny Application

Enock Lukhetfo Dube
22 APRIL 2015

INTRODUCTION

  • Given an incomplete sentence, this Shiny application uses an N-Gram language model of the United States (US) English to predict the next word in the sentence.

  • The language model consist of 1-gram, 2-gram, and 3-gram English tokens which were built from sample text from news, blogs and twitter texts.

  • The algorithm is based on Andrew Markov's assumption and, where necessary applies a simple Katz back-off algorithm.

  • Generally, the Markov assumption states that, a future event (next word) can be predicted using a relatively short history(context) such as 1 or 2 previous words.

NEXT WORD PREDICTION ALGORITHM

  1. Extract the last 2 words in the incomplete sentence, hereafter referred to as context

  2. Find context matching 3-gram token with highest maximum likelihood probability (MLEprob), and return the last word of the token as the next predicted word.

  3. If 3-gram match not found, backoff and consider only the last word to find highest MLEprob matching 2-gram token, and take last word as the next predicted word

  4. If 2-gram match is not found, then the algorithm returns keyword UNKOWN

THE APPLICATION INTERFACE

THE N-GRAM LANGUAGE MODEL

  • After cleaning the input text, N-grams were extrated and stored in a matrix .csv file that stores token counts and their maximum likelihood probabilities.

  • The figure below shows a sample of five 2-grams stored in the bigramMatrix.csv files

                 token count word_i_1     MLEprob
1:     science adviser     1  science 0.004219409
2:         science and    25  science 0.105485232
3:         science are     1  science 0.004219409
4:    science articles     1  science 0.004219409
5: science association     3  science 0.012658228