2025-01-29
In this project, I will build a application that predicts the next word based on the input text. I will explain how character prediction works and how to build the model briefly.
3,336,695 lines and 70,351,643 words in total.
The sentences include emoji, symbols(such as $), and Foreign languages(such as Japanese)
So, I will clean the texts before building the model.
| Previous 3 words | Next word | Probability |
|---|---|---|
| I have a | dream | 34.5% |
| Pen | 10.2% | |
| book | 7.5% |
N-garm predicts the next word based on combinations of words. It statistically investigates word combinations from a large number of documents and predicts the next word with the highest probability.
In this case, the next word after ‘I have a’ would be ‘dream’. This app uses N-garm to predict the next word.”
The basic model is 4-gram, but combinations with low frequency may not yield desirable results. Combinations that are not seen at least 10 times in the corpus are excluded.
If a word combination does not fit within the above 4-gram model, then the prediction will fall back to a 3-gram (the preceding 2 words) with a frequency of at least 10 times. If that also doesn’t fit, then it will fall back to a 2-gram (1 word) with a frequency of at least 10 times, predicting the word with the highest frequency among these combinations.