Peer-graded Assignment: Final Project Submission

Binh Nguyen Thanh
May 27, 2022

Introduction

This document was created as part of the Peer-graded Assignment: Final Project Submission in the Data Product Capstone course. The goal of this exercise is to create a product to highlight the Natural Language Processing algorithm that I have built in previous assignment and to provide an interface that can be accessed by others. For this project, I have submitted:

  1. A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
  2. The slide deck you are reading right now, consisting of no more than 5 slides. This presentation pitches the algorithm and the app.

Shiny App

The app consists of two panels. In this first panel, you can introduce a text (one or more words) and click the “Submit” button. Then, a predicted word will appear on the second panel.

You can check the shiny app in the next link: https://hinbearth.shinyapps.io/WordPrediction/

Predictive Algorithm

The algorithm works in the following way.

  1. Due to the restrictions of computational capacity and the possible environments where the app will work, the algorithm takes a reduced sample of the US twitter and news dataset.
  2. Then, the sample is cleaned and tokenized into pairs of words.
  3. Next, the sample is organized in a descending order in a data frame.
  4. Afterwords, the algorithm takes the last 4 words typed and compared it to the first 4 words of the entries in the data frame. If there is no match, the algorithm will takes from 3 words to 1 word until having matchs.
  5. In this way, the algorithm predicts the next word according to which is the most common entry of the sample that contains the last word of the text typed. This is a basic prediction of the text typed.

Final considerations

As pointed before, the app has advantages but certainly some limitations, too.

The prediction will entirely depend on the sample taken. In this exercise, the US twitter and news data was taken, and from it, a reduced sample.

Future exercises could improve the prediction by using a greater sample or a different dataset. However, as the sample taken is small, the algorithm is efficient enough to run without problems in any device.

Accuracy is sacrificed by efficiency. Future works could search other algorithms that are able to increase accuracy without diminishing efficiency.

Thanks a lot for reading and grading!