Final Capstone Project - typeR (Text Prediction)

Daniel Amaral
29/08/2020

Introduction

This is the Capstone project for Coursera's Data Science Specialization Course. The goal of this exercise is to create an interactive product (Shiny app) using natural language processing (NLP) model to predict the next set of words that users will most likely type.

In the next slides, this presentation will discuss:

  • Description of dataset used
  • Models and methods used in the prediction
  • About the Shiny app, including the instructions on how to use the app

The Dataset

  • Training dataset comes from Capstone SwiftKey Data
  • 10% of the US blogs, News, and Twitter data is used for generating the predictions
  • Basic clean up is done to remove double whitespaces, symbols, capitalization, and punctuation
  • The cleaned dataset is tozenized and categorized into 4, 3, 2 ngrams

The Algorithm

  • The next word predicted is based on Katz's Back-off model
  • First, the last three words of user's input text is searched in the four-gram table
  • If it is not found, the last two words of user's input text is searched in the tri-gram table
  • If it is not found in the trigram table, it is searched in the bigram table
  • At any point, if match(es) are found, the output is returned.
  • If no match is found, it will use the most common words

Shiny App

the application is very simple to use, basically we have a single text input and from the typed text, below is the suggestion (predicted word)