2024-08-21

Background

  • The task involves working with large text datasets in multiple languages (English, German, Russian, and Finnish), which require cleaning before giving them to the NLP Model.

  • The goal is to build a basic n-gram model that predicts the next word based on the preceding 1, 2, or 3 words. This involves creating efficient storage methods, handling unseen n-grams, and smoothing probabilities to ensure all word combinations have a non-zero probability.

  • The model must be optimized for size and runtime, as it should be capable of running on devices with limited memory and processing power, like mobile phones. Balancing memory usage and prediction speed is crucial to providing a smooth user experience.

Approach

1- Before implementing the two methods the data were divided into 80% training and 20% test then the data was normalized by converting it to lowercase , tokenz, remove numbers, # , @, remove URLs ,remove non-alphabetic, and remove extra spaces.

2- Using Markov Chains, an n-gram model is built to predict the next word based on the preceding 1 to 5 words, employing Add-One Smoothing to prevent zero probabilities and integrating a backoff strategy to handle unseen n-grams by reverting to simpler models. This approach, though effective in handling unseen data, had a low accuracy of around 70%. To improve accuracy, a refined n-gram model was implemented without smoothing after data cleaning, resulting in moderate accuracy. Additionally, removing n-grams that appeared only once helped reduce data complexity and potentially improved model.

Application anf Features

how to use the app 1- the app has a place to input your sentence

2- after writing your sentence press “predict next word”

3- the out put will be a list of the top 3 most frequent words with their frequencies also a graph was added to make it easier to compare the.

Chalenges and Future plans

1- Details the data was huge and i did not have enough ram on my laptop so i used ensemble technique and random sampling to tackle this issue. I learned that removing the stop words in this application will lead to incorrect predictions.

2 - The free plan of the shiny app allows only 1GB of data so from each Ngram from 2 to 6 only 10000 row lines were used.

3 - The two methods (N-grams and Markov chains) are not the best approaches to predict the next words for more accurate results in the future i will implement NN methods like GPT.

Thank You