Executive Summary

The goal of this project is to build a smart, predictive text engine—similar to the technology behind mobile phone keyboards—that anticipates the next word a user wants to type. This milestone report documents the first critical phase of development: downloading, cleaning, and exploring the foundational text data (the “corpus”).

By analyzing massive collections of data from Twitter, blogs, and news feeds, we have uncovered the structural blueprint of the language. This report outlines our core findings, provides data visualizations of word frequencies, and maps out our engineering strategy for the final predictive application.


1. Data Ingestion & Core Summary Statistics

We successfully ingested three distinct text files: blogs, news articles, and tweets. To understand the scale of our data, we performed a baseline evaluation to calculate total file sizes, line counts, and word counts.

Table 1: Structural Summary of Raw Training Datasets
Source File File Size (MB) Total Lines Total Words
Blogs (en_US.blogs.txt) 200.4 899,288 37,334,131
News (en_US.news.txt) 196.3 1,010,242 34,372,589
Twitter (en_US.twitter.txt) 159.4 2,360,148 30,373,543