Synopsis

This project was created for the Developing Data Products course as part of the Data Science Specialization offered through Coursera from Johns Hopkins University.
The source code files for this project can be found on GitHub:
https://github.com/ghaff24/capstone/tree/main/final

Course Project

The course project is a two part peer-graded assignment:
Create a Shiny application and deploy it on RStudio’s servers, Use Slidify or RStudio Presenter to prepare a reproducible pitch presentation about your application.
the shiny application can be found here : https://ghaff24.shinyapps.io/final24/

-The first step consisted on sampling our database. Particularly, I took only 10% of the observations.

Then using mainly the stringrand quanteda packages, as well as some tidyverse and tidytext, we separated each line into unigrams (individual words), bigrams (pair of words that follow each other), trigrams and quadgrams.
As expected, some ngrams are more common than others. For example, a quadgram saying “thanks for the memories” is far more common in twitter than, say, “thanks for the ostrich”.

The app works in a very simple way
First, it takes the prhase, and clean it from symbols, upper case letters, etc.
Then, depending on the length of the phrase, it takes up to the last 3 words as an input, and tries to match it with the first three words of a quadgram, and outputs the fourth word as a suggetion.
In case no match is found, it substracts one letter from the input, and tries again with a lower level ngram.
Really simple, and when in use, it takes less than 200 MW, which could be reduced to less than half that with a smaller sample from the data. ```