Word Prediction

Flavio Oliveri
2021/08/12

Johns Hopkins University
Coursera Data Science Specialization

Introduction

The goal of this presentation is to pitch the Word Prediction app with a brief explanation about the algorithim used in the text prediction.

Also the user interface will be described.

Word Prediction Application

Word Prediction application suggest the next word in a phrase using an n-gram algorithm.

The text used to build the model were collected from blogs, news and twitter data. Bigrams, Trigrams and 4grams were extracted from the corpus and used to build the model.

The Predictive Text Model

To build the model a sample of 1,000,000 lines from blogs news and twitter were used. The sample was tokenized and cleaned applying this conversions:

  • convert to lowercase
  • removed all non-ascii characters
  • remove URL and email addresses
  • punctuation and whitespaces

2grams, 3grams and 4grams were built by the resultant tokens

With the text entered by the user the algorithm iterates the n-grams to find a match. The result is the longest n-gram with the higher frecuency.

User Interface

Once the user finish typing in the text box, up to 3 predictions will appear on the side