ghaff24
2025-01-29
-The first step consisted on sampling our database. Particularly, I took only 10% of the observations.
Then using mainly the stringrand quanteda packages, as well as some tidyverse and tidytext, we separated each line into unigrams (individual words), bigrams (pair of words that follow each other), trigrams and quadgrams.
As expected, some ngrams are more common than others. For example, a quadgram saying “thanks for the memories” is far more common in twitter than, say, “thanks for the ostrich”.
The app works in a very simple way
First, it takes the prhase, and clean it from symbols, upper case letters, etc.
Then, depending on the length of the phrase, it takes up to the last 3 words as an input, and tries to match it with the first three words of a quadgram, and outputs the fourth word as a suggetion.
In case no match is found, it substracts one letter from the input, and tries again with a lower level ngram.
Really simple, and when in use, it takes less than 200 MW, which could be reduced to less than half that with a smaller sample from the data. ```