Data Science Capstone Next Word Prediction App

Leandro Guerra
August 2015

Executive Summary

The main idea behing Text Prediction is the estimation of the next character or word given a string of the input history. This may represent a useful solution to the problem of mistyping words and to suggest which is the next word that should be.
The objective of this project is to develop a text predictive algorithm derived from large data sets composed of different sources material such as blogs, twitter and news data.

Technical background

Based on the 1948 landmark paper “A Mathematical Theory of Communication”, from Claude Shannon
Using a Markov chain to create a statistical model of the sequences of words.
Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering.

Algorithm Details

To start, the main techinique used is the n-grams approach where n-gram is a contiguous sequence of n items from a given sequence of text or speech.
An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and son on.
These large sizes are not going to be used in this project.

App Overview

This app reads your text input and predicts the next word by searching through the most likely ngrams.
It only considers up to the last 3 words entered.
In this first version, is acceptable only the English language