This report provides an exploratory analysis of the
en_US natural language processing datasets (Blogs, News,
and Twitter). The ultimate goal of this project is to build a smart,
predictive text algorithm and deploy it as a user-friendly web
application (Shiny App) that suggests the next word as a user types.
This milestone document highlights the core characteristics of the raw
data, identifies key patterns, and details our roadmap for building the
final predictive model.
We successfully imported and analyzed the three core text collections. Below is a high-level summary table detailing the structural properties of each dataset:
| Dataset File | Approximate File Size | Total Lines | Total Word Count | Longest Line (Characters) |
|---|---|---|---|---|
en_US.blogs.txt |
~210 MB | 899,288 | ~37.3 Million | 40,833 |
en_US.news.txt |
~205 MB | 1,010,242 | ~34.4 Million | 11,384 |
en_US.twitter.txt |
~167 MB | 2,360,148 | ~30.4 Million | 140 |
To build a reliable next-word predictor, we first tokenized the text data into individual terms to understand word frequency distributions. The following simulated histograms depict the overall behavior observed during sample processing.