Capstone Project
2024-11-09
My goal was creating a simple Shiny web app that allows the user to input a sequence of words and will output a prediction of the next most likely 3 words.
The foundation of my text prediction app is an n-gram model that relies on:
the frequencies of n-grams, that is, sequences of n words that appear together in the corpus. In this case, I used single words, word pairs, triplets and sequences of four words that appeared at least 50 times.
the conditional probability of each n-gram to generate the prediction, with Modified Kneser-Nye smoothing.
The original model had a little over 3 million rows and 300 Mbs and so I decided to filter out any frequency counts below 50, which trimmed the N-gram model down to a little over 1 million rows and 100 Mbs and made it run faster without sacrificing accuracy ( ~ 20%).
I used data.table() to achieve faster performance.