Exectutive Summary

This milstone report outlines exploratory data analysis of the Capstone English language dataset (en_US). The main goal is to analyze the underlying structure of three distinct textual sources: Blogs, News articles, and Twitter feeds.

By inspecting basic summary metrics and word distributions, I establish a clean foundational baseline to construct a predictive text algorithm (Next-Word Prediction Engine) and deploy an interactive user interface via a Shiny Application.

1. Raw Dataset Summary Statistics

Before text manipulation, a structural assessment of the raw text files was performed to determine storage sizes, line depth, and token word distribution counts.

Table 1: Structural File Property Analytics Overview
File_Source File_Size_MB Total_Lines Total_Words
en_US.blogs.txt 200.42 899288 37546806
en_US.news.txt 196.28 1010206 34761151
en_US.twitter.txt 159.36 2360148 30096690

Key Analytical Takeaways:

Text Mining and Word Frequency Analysis

Top 15 Most Common Words Observed

Observation:

The token environment is heavily dominated by common structural connector stop words (such as “the”, “and”, and “to”). While standard data science pipelines filter these out, we must retain them for our predictive typing engine since users frequently type these combinations.

Investigating Word Combinations (Bigrams)

Strategic Engineering Plan for the Prediction Algorithm & Shiny App

Moving forward into production deployment, the engineering architecture is structured across two phases:

Phase 1: Predictive Engine Design:

Phase 2: User Interface (Shiny App Product):