- Data is loaded, cleaned, and separated into training/testing sets.
- Training set is tokenized into 1-, 2-, and 3-grams.
- For each n-gram, the frequency is recorded, and the word that is most likely to follow is identified.
- Prediction uses a modified back-off method, backing off if the ratio of evidence available for a shorter n-gram is sufficiently large. Ratio tuned on testing set.