Data Science Capstone Next Word Prediction App

Leandro Guerra
August 2015

Executive Summary

  • The main idea behing Text Prediction is the estimation of the next character or word given a string of the input history. This may represent a useful solution to the problem of mistyping words and to suggest which is the next word that should be.

  • The objective of this project is to develop a text predictive algorithm derived from large data sets composed of different sources material such as blogs, twitter and news data.

Technical background

  • Based on the 1948 landmark paper “A Mathematical Theory of Communication”, from Claude Shannon
  • Using a Markov chain to create a statistical model of the sequences of words.
  • Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering.

Algorithm Details

  • To start, the main techinique used is the n-grams approach where n-gram is a contiguous sequence of n items from a given sequence of text or speech.

  • An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and son on.

  • These large sizes are not going to be used in this project.

App Overview

  • This app reads your text input and predicts the next word by searching through the most likely ngrams.
  • It only considers up to the last 3 words entered.
  • In this first version, is acceptable only the English language