This project demonstrates a simple next-word prediction system using a Bigram Language Model. The application takes a user-entered phrase and predicts the most likely next word based on patterns learned from large text datasets including blogs, news articles, and Twitter data.
The goal of this project is to showcase basic natural language processing, model building, and deployment using R and Shiny.
The model was trained on three English text sources: - Blogs demonstrated conversational and informal language - News articles provided structured and formal language - Twitter data added short, real-world text patterns
A random sample of 2,000 lines was selected from the combined dataset to keep the model lightweight and efficient for web deployment.
The prediction model uses a Bigram Language Model: - Text is cleaned and converted to lowercase - Each sentence is split into word pairs (bigrams) - The frequency of each bigram is calculated - When a user enters a phrase, the model finds the most common word that follows the last word entered
If no match is found, a fallback word is returned.
The web application provides: - A text input box for entering a phrase - A prediction button to generate the next word - A real-time display of the predicted word
The app is lightweight and designed to load quickly while maintaining prediction accuracy.
This project demonstrates a complete workflow: - Data processing - Model creation - Web deployment
Future improvements may include: - Using trigrams instead of bigrams - Increasing training data size - Adding probability scores for predictions - Improving text cleaning and language handling