Data Science Capstone: Milestone Report

Executive Summary

The goal of this project is to build a smart, predictive text engine—similar to the technology behind mobile phone keyboards—that anticipates the next word a user wants to type. This milestone report documents the first critical phase of development: downloading, cleaning, and exploring the foundational text data (the “corpus”).

By analyzing massive collections of data from Twitter, blogs, and news feeds, we have uncovered the structural blueprint of the language. This report outlines our core findings, provides data visualizations of word frequencies, and maps out our engineering strategy for the final predictive application.

1. Data Ingestion & Core Summary Statistics

We successfully ingested three distinct text files: blogs, news articles, and tweets. To understand the scale of our data, we performed a baseline evaluation to calculate total file sizes, line counts, and word counts.

Table 1: Structural Summary of Raw Training Datasets
Source File	File Size (MB)	Total Lines	Total Words
Blogs (en_US.blogs.txt)	200.4	899,288	37,334,131
News (en_US.news.txt)	196.3	1,010,242	34,372,589
Twitter (en_US.twitter.txt)	159.4	2,360,148	30,373,543

Data Science Capstone: Milestone Report

Suddula Jeevan Sagar

2026-05-30

Executive Summary

1. Data Ingestion & Core Summary Statistics