Next Word Prediction

2026-05-26

Introduction

This is the tenth course of the Coursera Data Science Specialization, Data Science Capstone. This course focuses on analyzing a large corpus of text documents to discover the structure in the data and how words are put together to build a predictive text model.
Contents
- Text data analysis: analysis of the corpus to understand the relationship of words and word pairs
- Predictive modeling: build basic n-gram models and develop algorithms to facilitate text prediction
- Shiny app development: produce a web-based Shiny app interface to predict next words

Getting and cleaning the data: profanity was first removed and words tokenized
Exploratory data analysis: the frequencies of words and word parts were calculated
Modeling: 2-4 gram models were built to facilitate word prediction
Prediction model: - Katz’s back-off model was used to predict the next word - The model iterates from 3-gram to 2-gram to find matches in the last n-1 words - In the case of unseen n-gram, the most frequent word, ‘the’, is returned
If my computer had more memory (limited to 15Gb), the performance would have been better as I would have used longer n-grams.

The data analysis write up can be found on RPubs
The Shiny app for prediction can be found here
The app takes in the following inputs:
1. query word/phrase that the user inputs
2. number of predicted next word
The predicted next word(s) will show up in the order of most frequently used to less frequently used