Data Science Capstone Final Project Presentation

09/11/2021

Introduction (Project Overview)

This presentation is to introduce my English text prediction model for the final project of the Coursera Data Science Capstone. The aim of this presentation is to create a product to highlight the prediction algorithm that I have built and to provide an interface that can be accessed by others.

List of work that i have done in this project:

Develop a prediction algorithm that predicts the next word of a phrase entered by users
Embedded this algorithm into a Shiny Web App
Creat a slide deck introducing this product

N-Gram Linguistics Model

A N-Gram Linguistics Model was used creating this prediction algorithm. Steps for building the N-Gram library as below:

Sub-sampling: Extract ~1% of words in each text file as the corpus for building the Ngram library.
Cleaning corpus: Removing stopwords, symbols, punctuation, numbers and profanity words, then converted all text to lowercase
Tokenize text
Building the N-gram model with unigram, bigram, trigram and quadgram.
Sorted the corpus of each Ngram library into a frequency matrix.
The N-gran matrices were converted into a data table and the metadata were saved as .rds file.

Prediction algorithm

This prediction algorithm was built using the Katz Back-off model.

Load the .rds files containing the metadata
Read the text input
Predict the next work starting with quadgram
If fail to predict, back off to trigram, then bigram and then unigram.
If fail to predict after going through all Ngrams, return NA.

Shiny Web App

Click HERE to access the web app

Follow the steps to run the algorithm:

Wait a few seconds for the app to start
Insert some text in the text box
Clcik “Show me what you’ve got!” box to find out the prediction!