Data Science Specialization from Johns Hopkins University
2025-07-25
This is the Slide Deck for the Capstone Project from Coursera and Johns Hopkins University Data Science Specialization. The goal for the Capstone Project is to create the Next Word Predict App, a Shiny App with a textbox that, using given data and like the keyboards from smartphones, produces three options for what the next typed word might be.
To build this app, we used text data from three main sources: blogs, news articles, and Twitter posts. This data is available to download here and only the files in English were used.
The Next Word Predict App is located here.
To predict the next word in a sentence, we implemented a simple but effective n-gram backoff model, often referred to as the “stupid backoff” algorithm. The idea is to prioritize longer matching sequences when possible.
This method is fast and memory-efficient, and by assigning decreasing weights to lower-order n-grams, the predictions remain reasonably accurate without requiring complex machine learning techniques.
The final product is a Shiny web application that allows users to enter a short phrase in a textbox. As soon as the input is typed, the app predicts the next most probable word, displaying three possible options. The app uses the preprocessed n-gram data and the backoff algorithm to generate real-time predictions. It provides a simple and intuitive interface, and is hosted on shinyapps.io for broader access.
The backend ensures fast response even with large datasets. This makes the app suitable as a prototype for keyboard suggestion systems or similar NLP tools.
This project demonstrates that a basic n-gram backoff model can deliver reasonable next-word predictions in a real-time web interface. Despite its simplicity, the approach performs well with minimal computational resources.
In the future, the model could be enhanced by including support for spelling correction, context-aware suggestions, or multilingual prediction. The current version lays a solid foundation for more sophisticated natural language applications.