NextWordPrediction

Mahendra Kumar lal
20-May-2019

Coursera Data Science Specialization Capstone Project

Introduction

  • Approach

The data came from HC Corpora with three files (Blogs, News and Twitter).

After loading the data, a sample was created, cleaned and prepared to be used as a corpus of text. It was converted to lower case, removed the punctuation, links, whitespace, numbers and profanity words.

The sample text was “tokenized” into n-grams to construct the predictive models (Tokenization is the process of breaking a stream of text up into words, phrases. N-gram is a contiguous sequence of n items from a given sequence of text).

The n-grams files or data.frames (unigram, bigram, trigram and quadgram) are matrices with frequencies of words, used into the algorithm to predict the next word based on the text entered by the user.

Description of the Algorithm

Capture input text, including all preceding words in the phrase

Iteratively traverse n-grams (longest to shortest) for matches

On match(es), use the longest, most common, n-gram

Last word in the matching n-gram is the predicted next word

If no match in {5, 4, 3, 2}-grams, resort to randomly selecting a most frequently occurring 1-gram (e.g. common word)

The App - NextWordPredictor

  • Description of App

The Shiny application allow the prediction of the next possible word in a sentence.

The user entered the text in an input box, and in the other one, the application returns the most probability word to be used.

The predicted word is obtained from the n-grams matrices, comparing it with tokenized frequency of 2, 3 and 4 grams sequences.

While entering the text, the field with the predicted next word refreshes instantaneously, and then the predicted word is then provided for the user's choice.

text-predictor interactively performs word/phrase completion!

The App

  • Performance
  • 17% Accuracy (using only first, top-ranked response)
  • 24% Accuracy (selecting from top-5 ranked responses)
  • Mean Response Time: 280ms
  • Memory: 7.5MB compressed, 95 MB in-memory