Peer-graded Assignment: Final Project Submission By Raklami Anas, January 8, 2025

Introduction

This document was created as part of the Peer-graded Assignment: Final Project Submission in the Data Product Capstone course. The goal of this exercise is to create a product to highlight the Natural Language Processing algorithm that I have built in previous assignment and to provide an interface that can be accessed by others. For this project, I have submitted:

  1. A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

  2. The slide deck you are reading right now, consisting of no more than 5 slides. This presentation pitches the algorithm and the app.

    Shiny App

    The app consists of two panels. In this first panel, you can introduce a text (one or more words) and click the “Submit” button. Then, a predicted word will appear on the second panel.

    You can check the shiny app in the next link: https://xtmoes-raklami-anas.shinyapps.io/Cap_final/

Predictive Algorithm

The algorithm operates as follows:

Due to constraints in computational capacity and the diverse environments where the app may function, the algorithm processes a reduced sample of the US Twitter and news dataset. This sample is then cleaned and tokenized into word pairs for easier analysis.

Next, the tokenized sample is organized in descending order within a data frame. The algorithm predicts the next word by examining the last four words typed by the user and comparing them to the first four words of the entries in the data frame. If no match is found, the algorithm progressively reduces the comparison to three, two, or even one word until a match is identified.

In this way, the algorithm predicts the next word based on the most common entry in the sample containing the last word of the user’s input. This provides a basic yet effective word prediction mechanism.

Final Considerations

As pointed out earlier, the app offers both advantages and limitations. The prediction accuracy entirely depends on the sample size used. In this exercise, a reduced sample from the US Twitter and news data was used.

Future improvements could enhance prediction accuracy by increasing the sample size or utilizing a different dataset. However, since the current sample is small, the algorithm remains efficient enough to operate seamlessly on any device.

This approach prioritizes efficiency over accuracy. Future works could explore alternative algorithms capable of boosting prediction accuracy without compromising efficiency.

Thank you for reading and evaluating this work!