This milestone report documents the exploratory data analysis. The goal is to build a predictive text application, similar to smartphone keyboard suggestions, that predicts the next word a user will type.
The three source corpora used are:
| Source | Description |
|---|---|
| Blogs | Personal blog entries |
| News | News articles |
| Short-form social-media posts |
## Blogs lines: 899288
## News lines: 1010206
## Twitter lines: 2360148
| File | Lines | Words | Size (MB) |
|---|---|---|---|
| Blogs | 899288 | 37334131 | 210.2 |
| News | 1010206 | 34371031 | 205.8 |
| 2360148 | 30373583 | 167.1 |
Twitter has the most lines (~2.36 million), but each entry is very short due to the character limit, making it more fragmented than the other datasets.
Blogs and News contain fewer lines, but each line is much longer and more information-dense.
The combined dataset is nearly 583 MB — large enough to support a robust text prediction model.
| Source | Mean | Median |
|---|---|---|
| Blogs | 43.0 | 29 |
| News | 34.2 | 31 |
| 12.9 | 12 |
Figure 1 — Words-per-line distribution (top) and top-20 most frequent words by corpus (bottom)
Key findings:
The next word prediction system will follow these steps:
Step 1 — Data Cleaning
Step 2 — Tokenisation
Step 3 — N-gram Construction
Step 4 — Prediction Logic
Next Word Predictor
Type your text:
┌─────────────────────────────────────────┐
│ I am going to the ... │
└─────────────────────────────────────────┘
Suggested next words:
[ store ] [ park ] [ gym ] [ beach ]
─────────────────────────────────────────
Top Predictions (with confidence):
1. store — 42 %
2. park — 18 %
3. gym — 12 %
Key Shiny app features:
This report demonstrates successful loading and exploratory analysis of the HC Corpora datasets. The data reveals distinct writing styles across sources — particularly Twitter’s brevity versus Blogs’ longer format. A back-off n-gram model and interactive Shiny app are planned to deliver efficient, real-time next-word prediction.