Predictive Text Model using N-grams and Backoff Strategy

November 10, 2024

How the Model Works

Evaluated on a subset of the SwiftKey dataset: 1,500,000 lines (~45 million words).
The model uses n-grams (Unigram to Fivegram) for predicting the next word:
- It considers the last 1 to 4 words typed by the user.
- Leverages frequency tables to identify the most likely next words.
- Implements a backoff strategy: If no match is found with longer n-grams, it falls back to shorter n-grams.
Optimized for real-time performance:
- Response times are less than 0.5 seconds, suitable for interactive use.
- Displays the top 3 predictions along with their probabilities.

Tokenization: The user’s input is split into individual tokens (words).
N-gram Matching: The app searches the n-gram tables for the best match based on the most recent words.
Backoff Strategy: If a higher-order n-gram match isn’t found, the model falls back to lower-order n-grams.
Prediction Output: Displays the top predictions, sorted by frequency, with confidence levels.

Visit the app: Shiny App
Key features:
- Real-time, responsive predictions as you type.
- Visual progress bars indicate the confidence of each prediction.
- Example sentences allow users to explore the app’s capabilities.