DrAmericasBoo'sPath
September 22, 2021
Data sample was created from the HC Corpora data. As part of cleaning/pre-processing, the text was converted to all-lowercase, and all non-text characters such as punctuation marks, whitespace, numbers, URLs etc.
This cleaned data sample was then tokenized into n-grams, a contiguous sequence of n items from a sequence of text or speech. The n-grams of our interest are the bi-,tri- and the quad-grams.
A model is built from the N-grams. A Simple Good Turing (SGT) probability model is computed for the frequency of the N-grams.
The prediction is reasonable, but may not be the best. Natural language processing is a big problem in computing, and an individual project like this you can only do so much.