Data Science Capstone Final Project

2025-07-25

Objective

This project was developed as the final capstone for the Johns Hopkins University Data Science Specialization on Coursera. The objective was to design and implement a word prediction application using R and Shiny, applying natural language processing (NLP) techniques to build a functional and interactive tool.

As part of the project, we were provided with a large corpus of text from HC Corpora, which includes content from blogs, Twitter, and news articles. Although the corpus supports multiple languages, we were required to work exclusively with the English-language datasets.

The final product is a Shiny web application that allows users to input text and receive up to four predicted next-word suggestions, enhancing typing efficiency and demonstrating the power of statistical language modeling.

Development

The application is powered by NLP models based on:

N-gram language modeling to analyze word sequences, Markov chains to estimate transition probabilities between words, and Katz’s back-off algorithm to handle unseen n-grams by reverting to lower-order models.

The series of steps to build the model were:

Cleaning and preparing the data
Exploratory Analysis
Build n-grams from the data corpus
Build frequencies from the n-grams
Build the prediction model

The Shiny App

The application predicts the next word in a sentence or phrase.

It provides up to four possible word suggestions, displayed as clickable buttons. When you select one of the suggestions,

it is automatically added to your text, and the application continues predicting the next word based on the updated input.

Appendix

Natural Language Processing

N-grams

Markov Model

Katz’s back-off model

Github Repository

Shiny App