2026-05-26
Introduction
- This is the tenth course of the Coursera Data Science Specialization, Data Science Capstone. This course focuses on analyzing a large corpus of text documents to discover the structure in the data and how words are put together to build a predictive text model.
- Contents
- Text data analysis: analysis of the corpus to understand the relationship of words and word pairs
- Predictive modeling: build basic n-gram models and develop algorithms to facilitate text prediction
- Shiny app development: produce a web-based Shiny app interface to predict next words
Modeling
- Getting and cleaning the data: profanity was first removed and words tokenized
- Exploratory data analysis: the frequencies of words and word parts were calculated
- Modeling: 2-4 gram models were built to facilitate word prediction
- Prediction model: - Katz’s back-off model was used to predict the next word - The model iterates from 3-gram to 2-gram to find matches in the last n-1 words - In the case of unseen n-gram, the most frequent word, ‘the’, is returned
- If my computer had more memory (limited to 15Gb), the performance would have been better as I would have used longer n-grams.
Results
- The data analysis write up can be found on RPubs
- The Shiny app for prediction can be found here
- The app takes in the following inputs:
- query word/phrase that the user inputs
- number of predicted next word
- The predicted next word(s) will show up in the order of most frequently used to less frequently used
Reference