Word Prediction Application

Mauricio Ramos
May 24, 2018

Word prediction aplication

This is a prototype Shiny application required by the capstone project of the Johns Hopkins University's Data Science Specialization.

The application takes a phrase as input and outputs a prediction of the next word using a Natural Language Processing (NLP) machine learning model trained from a large corpora of English texts from Blogs, News and Twitter available here.

The NLP machine learning model

It was based on technologies and techniques cited in the last slide. The overall steps taken are:

  1. Build a corpus from 3 English files containing 4M+ lines of 100M+ words, with a size of 552 MB;
  2. Segment the 4M+ lines in 8M+ sentences;
  3. Generate 116M+ lowercased unigrams removing numbers, non-word characters, URLs, and words longer than 20;
  4. Generate n-gram frequency tables for orders 1 to 5;
  5. Remove singletons from n-gram tables of order 4 and 5;
  6. Compress data using integers rather than floating numbers and factors rather than characters;
  7. Keep only the first highest frequent n-gram by each n-1 words.

The prediction algorithm

Currently it's a basic prediction algorithm that prioritizes the responsiveness and the low-memory usage.

It limits to the maximum of 4 predictor words.

It seeks the last k predictor words in the k+1 n-gram table.

If the last k predictor words weren't found it seeks the last k-1 predictor words in the k+2 n-gram tables and so on.

How to use the application

  1. Go to https://mauriciocramos.shinyapps.io/predictWord/

  2. Type in one ore more words

  3. Press the button

  4. See the predicted word just bellow the button.

References