Next Word Prediction using N-Gram Language Model

How the Prediction Model Works

The prediction model was built using the English blogs, news and Twitter datasets provided for this project.

The following steps were used to build the model:

Combined the three text datasets into one corpus.
Converted all text to lowercase.
Removed punctuation, numbers and extra spaces.
Created unigram, bigram, trigram and quadgram tables.
Counted how often each word combination appeared.
Saved the final n-gram tables as RDS files to reduce loading time.

When a user enters a phrase, the application searches these n-gram tables to find the most likely next word.

Backoff Prediction Method

Sometimes the exact phrase entered by the user is not available in the training data. To handle this situation, the application uses a simple backoff approach.

The prediction works in the following order:

First, it checks the quadgram table using the last three words.
If there is no match, it checks the trigram table using the last two words.
If there is still no match, it searches the bigram table using the last word.
If nothing is found, the model returns the most common unigram as the final prediction.

This approach helps the application return a prediction even when an exact word combination is not available.

Performance and Shiny Application

The prediction model was designed to give results quickly while using less memory.

Some steps taken to improve performance were:

A 2% random sample of the corpus was used to build the model.
Less frequent n-grams were removed to reduce the model size.
object.size() was used to check memory usage.
gc() was used during model building to free unused memory.
The final n-gram tables were saved as RDS files, so the Shiny app loads faster without reading the original text files.

The Shiny application allows the user to enter a phrase, click the Predict button, and receive one predicted next word.

Project Summary

This project demonstrates a simple next word prediction application built using an n-gram language model. The prediction is based on word patterns learned from blogs, news articles and Twitter text.

A backoff method is used so the application can still suggest a word even when an exact phrase is not available in the model. This helps make the prediction more useful for different user inputs.

The Shiny application provides a simple interface where users can enter a phrase and receive one predicted next word. In future, the prediction quality can be improved by using a larger training dataset, better smoothing techniques and more advanced language models.