Data Science Capstone Product Pitch

Developing a predictive model to predict the next word based on user inputs

View the application

Chan

18/01/2016

Introduction

The objective of the project is to develop an application that accepts multiple text inputs from the user and generates a prediction of the next word.

The text data that was employed in this application is obtained from the HC Corpora text corpus. The text corpus is over 500MB in size and contains over 100 million words sourced from blogs, news and twitter.

Techniques such as natural language processing (NLP), n-grams and Markov chains have been used to produce the prediction model.

The following features are supported in this application:

  • Next word prediction
  • Description of high level approach
  • Links to documentation

High Level Approach

app image

  • Preprocessing - Removal of all numbers, punctuation, special characters and whitespaces, and convert all words to lowercase

  • Tokenization - Truncating input string to last 4 words. All words will be used if there are less than 4 words

  • Pattern Matching - Attempt to perform pattern matching of the input with the 4-gram, 3-gram, 2-gram and 1-gram frequency matrices

  • Next Word Prediction - Pattern that returns the highest frequency from the frequency matrices is selected as the predicted next word

Algorithms

Natural language processing (NLP)
Ability to process text and make the information accessible to computer applications. This approach is used to perform cleansing of the text corpus by stemming, removing numbers, punctuation, and special characters

n-gram model
Probabilistic model for predicting the next item in a continous sequence of n-items from a given sequence of words. This model is used to generate unigrams, bigrams, trigrams and quadgrams from the tokenization of the text corpus.

Markov chain
Sequence of random models used to describe a chain of linked events, where what happens next depends only on the current state of the system. This model is used to compute the probabilities of each n-gram token and store them in term frequency matrices

Running the App

app image

The user begins by typing in a word or phrase in input box. The application will refresh and display the List of entered words and Next word prediction.

The application can be accessed online on RStudio's Shinyapp Server