09/11/2021

Introduction (Project Overview)

This presentation is to introduce my English text prediction model for the final project of the Coursera Data Science Capstone. The aim of this presentation is to create a product to highlight the prediction algorithm that I have built and to provide an interface that can be accessed by others.

List of work that i have done in this project:

  • Develop a prediction algorithm that predicts the next word of a phrase entered by users
  • Embedded this algorithm into a Shiny Web App
  • Creat a slide deck introducing this product

N-Gram Linguistics Model

A N-Gram Linguistics Model was used creating this prediction algorithm. Steps for building the N-Gram library as below:

  1. Sub-sampling: Extract ~1% of words in each text file as the corpus for building the Ngram library.
  2. Cleaning corpus: Removing stopwords, symbols, punctuation, numbers and profanity words, then converted all text to lowercase
  3. Tokenize text
  4. Building the N-gram model with unigram, bigram, trigram and quadgram.
  5. Sorted the corpus of each Ngram library into a frequency matrix.
  6. The N-gran matrices were converted into a data table and the metadata were saved as .rds file.

Prediction algorithm

This prediction algorithm was built using the Katz Back-off model.

  1. Load the .rds files containing the metadata

  2. Read the text input

  3. Predict the next work starting with quadgram

  4. If fail to predict, back off to trigram, then bigram and then unigram.

  5. If fail to predict after going through all Ngrams, return NA.

Shiny Web App

  • Click HERE to access the web app

Follow the steps to run the algorithm:

  1. Wait a few seconds for the app to start

  2. Insert some text in the text box

  3. Clcik “Show me what you’ve got!” box to find out the prediction!