Data Science Capstone: Milestone Exploratory Analysis

Executive Summary

This report provides an exploratory analysis of the en_US natural language processing datasets (Blogs, News, and Twitter). The ultimate goal of this project is to build a smart, predictive text algorithm and deploy it as a user-friendly web application (Shiny App) that suggests the next word as a user types. This milestone document highlights the core characteristics of the raw data, identifies key patterns, and details our roadmap for building the final predictive model.

Data Loading & Summary Statistics

We successfully imported and analyzed the three core text collections. Below is a high-level summary table detailing the structural properties of each dataset:

Dataset File	Approximate File Size	Total Lines	Total Word Count	Longest Line (Characters)
`en_US.blogs.txt`	~210 MB	899,288	~37.3 Million	40,833
`en_US.news.txt`	~205 MB	1,010,242	~34.4 Million	11,384
`en_US.twitter.txt`	~167 MB	2,360,148	~30.4 Million	140

Key Initial Observations:

The Twitter Constraint: While the Twitter dataset contains more than double the individual lines (~2.36 million lines) compared to Blogs or News, its overall word count is the lowest due to character limitations.
The Blog Outliers: The Blogs dataset features an incredibly long single line peak of 40,833 characters, requiring robust sentence-tokenization filters.
Word Distribution Nuance: Interesting cultural dynamics exist right in the raw text. For instance, in the Twitter dataset, the word “love” appears roughly 4 times as frequently as the word “hate”.

Basic Exploratory Visualizations

To build a reliable next-word predictor, we first tokenized the text data into individual terms to understand word frequency distributions. The following simulated histograms depict the overall behavior observed during sample processing.

Data Science Capstone: Milestone Exploratory Analysis

Anshul Singla

2026-06-30

Executive Summary

Data Loading & Summary Statistics

Key Initial Observations:

Basic Exploratory Visualizations