NLP Milestone

Satoshi Ohnishi

2025-01-29

About the project

In this project, I will build a application that predicts the next word based on the input text. I will explain how character prediction works and how to build the model briefly.

  1. Next word prediction model using N-gram model
  2. To build the model, I will use the data from English language corpus (blog, news, and twitter)
  3. The application will be built using Shiny of R language

About Corpus

  • 3,336,695 lines and 70,351,643 words in total.

  • The sentences include emoji, symbols(such as $), and Foreign languages(such as Japanese)

  • So, I will clean the texts before building the model.

What is N-grams

Previous 3 words Next word Probability
I have a dream 34.5%
Pen 10.2%
book 7.5%
  • N-garm predicts the next word based on combinations of words. It statistically investigates word combinations from a large number of documents and predicts the next word with the highest probability.

  • In this case, the next word after ‘I have a’ would be ‘dream’. This app uses N-garm to predict the next word.”

How to predict next word

  • The basic model is 4-gram, but combinations with low frequency may not yield desirable results. Combinations that are not seen at least 10 times in the corpus are excluded.

  • If a word combination does not fit within the above 4-gram model, then the prediction will fall back to a 3-gram (the preceding 2 words) with a frequency of at least 10 times. If that also doesn’t fit, then it will fall back to a 2-gram (1 word) with a frequency of at least 10 times, predicting the word with the highest frequency among these combinations.