The task involves working with large text datasets in multiple languages (English, German, Russian, and Finnish), which require cleaning before giving them to the NLP Model.
The goal is to build a basic n-gram model that predicts the next word based on the preceding 1, 2, or 3 words. This involves creating efficient storage methods, handling unseen n-grams, and smoothing probabilities to ensure all word combinations have a non-zero probability.
The model must be optimized for size and runtime, as it should be capable of running on devices with limited memory and processing power, like mobile phones. Balancing memory usage and prediction speed is crucial to providing a smooth user experience.