The prediction engine uses N-gram models:
- Bigrams (2-word phrases)
- Trigrams (3-word phrases)
- Quadgrams (4-word phrases)
Built using cleaned corpora from Blogs, News, and Twitter.
Prediction logic uses Stupid Backoff Algorithm:
- Tries to match the last 3 words to a quadgram.
- Falls back to trigram and bigram if not found.
- Picks the most frequent matching n-gram.
Preprocessing included:
- Lowercasing
- Removing punctuation, numbers, stopwords
- Tokenization
How the N-Gram Model Works
Data Collection:
- Brief Description: Gather a large corpus of text data from relevant sources. Preprocess the text by tokenizing and cleaning to ensure the data is in a suitable format for model training.
Model Training:
- Brief Description: Build the n-gram model by analyzing the sequences of words in the training data. The model learns the probability of a word occurring given the preceding one, two, or three words.
Prediction:
- Brief Description: Use the trained model to predict the next word in a sequence based on the previous one, two, or three words provided by the user.
Validation: