3/7/2020

Overview

The goal of this exercise is to create a product to highlight the prediction algorithm that we have built and to provide an interface that can be accessed by others. For this project we must submit:

  • A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word:

  • This slide deck consisting of no more than 5 slides created with R Studio Presenter.

Data Preparation

  • We loaded three datasets containing sentences taken from different sources: Blogs, News and Twitter.

  • We merged them together in a unique corpus.

  • We cleaned the data, converting to lowercase, deleting numbers and stopwords, removing punctuation and stripping extra white spaces.

  • We built the n-grams: bigrams, trigrams, quadgrams.

  • We sorted the n-grams in descending order.

  • We saved them.

Prediction Algorithm

  • The Shiny App load the n-grams and use them in its prediction activity.

  • For this prediction, we use a version of the Katz’s Back-off Algorithm, using quadgrams when possible, if not possible then trigrams, if not possible then bigrams, if not possible then the common word “the”.

Accuracy

The accuracy was calculated to be 73%.