A Text Predictor Application

Pier Lorenzo Paracchini
28.05.2016

The Challenge

Developing a prediction model for next word

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain.

When someone, for example, types “I went to the” the application should presents at least three options for what the next word might be and it should be able to run as a mobile/ web app in a responsive way.

The Supporting Data

The data to used for building the predictive model is coming from the HC corpora. The corpora is a collection of 3 different corpus (twitter, news and blogs) with the aim of getting a varied and comprehensive corpus of current use of the languages.

The original corpora, with focus only on the english language (en_US), includes:

  • 2.360.148 tweets
  • 1.010.242 news
  • 899.288 blogs

The Ingestion Process

The Ingestion Process

The Language Model: "Stupid" backoff

Different models have been implemented: n-grams (n = 1,2,3), linear interpolation (n-grams, n = 1,2,3) with Good Turing smoothing and “Stupid” backoff (with no discount).

The model evaluations has been done using the perplexity measurement and an ad-hoc testing dataset (around 40 sentences). The “Stupid” Backoff model was the one able to minimize the perplexity measurement.

The Application - Basic Usage

The App

Kudos

I would like to express my deepest appreciation to the great professors of Johns Hopkins University for making this specialization available at Coursera. Special kudos to all of the participants of this Capstone project for the valuable discussions, tips and tricks made available in the forums. If you want to keep in contact please just add my LinkedIn profile to your LinkedIn connections.

It has been a long and challenging journey with ups and downs, worth every single moment. Thank you to you all!!