Exectutive Summary

This milstone report outlines exploratory data analysis of the Capstone English language dataset (en_US). The main goal is to analyze the underlying structure of three distinct textual sources: Blogs, News articles, and Twitter feeds.

By inspecting basic summary metrics and word distributions, I establish a clean foundational baseline to construct a predictive text algorithm (Next-Word Prediction Engine) and deploy an interactive user interface via a Shiny Application.

1. Raw Dataset Summary Statistics

Before text manipulation, a structural assessment of the raw text files was performed to determine storage sizes, line depth, and token word distribution counts.

Table 1: Structural File Property Analytics Overview
File_Source	File_Size_MB	Total_Lines	Total_Words
en_US.blogs.txt	200.42	899288	37546806
en_US.news.txt	196.28	1010206	34761151
en_US.twitter.txt	159.36	2360148	30096690

Key Analytical Takeaways:

Storage Footprint: The datasets aggregate to over 550 MB of raw unstructured string data, requiring downsampling optimizations for stable text mining.
Length Constraints: Blog posts feature the longest continuous sentence lengths, while Twitter datasets maintain strict character ceilings resulting in high line density but compact word metrics.

Text Mining and Word Frequency Analysis

Top 15 Most Common Words Observed

Observation:

The token environment is heavily dominated by common structural connector stop words (such as “the”, “and”, and “to”). While standard data science pipelines filter these out, we must retain them for our predictive typing engine since users frequently type these combinations.

Investigating Word Combinations (Bigrams)

Strategic Engineering Plan for the Prediction Algorithm & Shiny App

Moving forward into production deployment, the engineering architecture is structured across two phases:

Phase 1: Predictive Engine Design:

N-Gram Back-Off Modeling: Construct sorted operational reference lookups for Quadgrams (4 words), Trigrams (3 words), and Bigrams (2 words).
Execution Path: When a user enters text, the algorithm checks the final 3 words against the Quadgram database. If no match exists, it “backs off” to look at the last 2 words in the Trigram matrix, and so forth.
Optimization: Words with low occurrence counts will be pruned to compress the model file size, keeping application response latency below 100 milliseconds.

Phase 2: User Interface (Shiny App Product):

Input Interface: A simple text box where a non-technical manager can naturally type expressions.
Reactive Output: The app backend dynamically listens to keystrokes and instantly outputs the top three predicted next words as selectable buttons.

Data Science Capstone: Milestone Report

Angel

26 May 2026