NLP Milestone

Satoshi Ohnishi

2025-01-29

About the project

In this project, I will build a application that predicts the next word based on the input text. I will explain how character prediction works and how to build the model briefly.

Next word prediction model using N-gram model
To build the model, I will use the data from English language corpus (blog, news, and twitter)
The application will be built using Shiny of R language

About Corpus

3,336,695 lines and 70,351,643 words in total.
The sentences include emoji, symbols(such as $), and Foreign languages(such as Japanese)
So, I will clean the texts before building the model.

What is N-grams

Previous 3 words	Next word	Probability
I have a	dream	34.5%
	Pen	10.2%
	book	7.5%

N-garm predicts the next word based on combinations of words. It statistically investigates word combinations from a large number of documents and predicts the next word with the highest probability.
In this case, the next word after ‘I have a’ would be ‘dream’. This app uses N-garm to predict the next word.”

How to predict next word

The basic model is 4-gram, but combinations with low frequency may not yield desirable results. Combinations that are not seen at least 10 times in the corpus are excluded.
If a word combination does not fit within the above 4-gram model, then the prediction will fall back to a 3-gram (the preceding 2 words) with a frequency of at least 10 times. If that also doesn’t fit, then it will fall back to a 2-gram (1 word) with a frequency of at least 10 times, predicting the word with the highest frequency among these combinations.