Final Capstone Project - typeR (Text Prediction)

Daniel Amaral
29/08/2020

Introduction

This is the Capstone project for Coursera's Data Science Specialization Course. The goal of this exercise is to create an interactive product (Shiny app) using natural language processing (NLP) model to predict the next set of words that users will most likely type.

In the next slides, this presentation will discuss:

Description of dataset used
Models and methods used in the prediction
About the Shiny app, including the instructions on how to use the app

The Dataset

Training dataset comes from Capstone SwiftKey Data
10% of the US blogs, News, and Twitter data is used for generating the predictions
Basic clean up is done to remove double whitespaces, symbols, capitalization, and punctuation
The cleaned dataset is tozenized and categorized into 4, 3, 2 ngrams

The Algorithm

The next word predicted is based on Katz's Back-off model
First, the last three words of user's input text is searched in the four-gram table
If it is not found, the last two words of user's input text is searched in the tri-gram table
If it is not found in the trigram table, it is searched in the bigram table
At any point, if match(es) are found, the output is returned.
If no match is found, it will use the most common words

Shiny App

the application is very simple to use, basically we have a single text input and from the typed text, below is the suggestion (predicted word)

Link to the App