06 December 2025

Data Science Capstone Assignment

This project focuses on developing a Next Word Prediction App using Natural Language Processing (NLP) techniques in R.

The application is built using Shiny and the report is presented through R Markdown using a slide-based layout.

Application Overview

  • The app predicts the next likely word as a user types text.
  • Uses N-Gram language modeling and Stupid Backoff algorithm.
  • Built using text data from Blogs, News, and Twitter.
  • Deployed online as an interactive Shiny application.
  • Helps enhance user typing experience similar to mobile keyboards.

Slide with Complete Reporting Appliaction

Data Used

The data used for this project comes from the HC Corpora dataset, which includes text from:

  • Blogs
  • News articles
  • Twitter posts

The original dataset is very large (over 500MB). To optimize performance:

  • A small random sample (2%) was taken from each source
  • Profanity and unwanted characters were removed
  • Data was tokenized to build N-gram language models
library(quanteda)
summary(corpus_data)