The goal of this project is to build a smart, predictive text engine—similar to the technology behind mobile phone keyboards—that anticipates the next word a user wants to type. This milestone report documents the first critical phase of development: downloading, cleaning, and exploring the foundational text data (the “corpus”).
By analyzing massive collections of data from Twitter, blogs, and news feeds, we have uncovered the structural blueprint of the language. This report outlines our core findings, provides data visualizations of word frequencies, and maps out our engineering strategy for the final predictive application.
We successfully ingested three distinct text files: blogs, news articles, and tweets. To understand the scale of our data, we performed a baseline evaluation to calculate total file sizes, line counts, and word counts.
| Source File | File Size (MB) | Total Lines | Total Words |
|---|---|---|---|
| Blogs (en_US.blogs.txt) | 200.4 | 899,288 | 37,334,131 |
| News (en_US.news.txt) | 196.3 | 1,010,242 | 34,372,589 |
| Twitter (en_US.twitter.txt) | 159.4 | 2,360,148 | 30,373,543 |