Next Word Prediction

Capstone Project

Ivo Pinheiro

2024-11-09

Background

  • My goal was creating a simple Shiny web app that allows the user to input a sequence of words and will output a prediction of the next most likely 3 words.

  • The foundation of my text prediction app is an n-gram model that relies on:

    1. the frequencies of n-grams, that is, sequences of n words that appear together in the corpus. In this case, I used single words, word pairs, triplets and sequences of four words that appeared at least 50 times.

    2. the conditional probability of each n-gram to generate the prediction, with Modified Kneser-Nye smoothing.

What about the user?

  • The app updates in real-time, so the user doesn’t need to write a sentence and then click submit to generate predictions. As the user types a sequence of words in English, the app should generate predictions automatically.

Notes on performance

  • The original model had a little over 3 million rows and 300 Mbs and so I decided to filter out any frequency counts below 50, which trimmed the N-gram model down to a little over 1 million rows and 100 Mbs and made it run faster without sacrificing accuracy ( ~ 20%).

  • I used data.table() to achieve faster performance.

The app