Introduction

This milestone report documents the exploratory data analysis. The goal is to build a predictive text application, similar to smartphone keyboard suggestions, that predicts the next word a user will type.

The three source corpora used are:

Source	Description
Blogs	Personal blog entries
News	News articles
Twitter	Short-form social-media posts

Loading the Data

## Blogs lines: 899288

## News lines: 1010206

## Twitter lines: 2360148

File Summary Statistics

File Summary Table
File	Lines	Words	Size (MB)
Blogs	899288	37334131	210.2
News	1010206	34371031	205.8
Twitter	2360148	30373583	167.1

Key Observations

Twitter has the most lines (~2.36 million), but each entry is very short due to the character limit, making it more fragmented than the other datasets.
Blogs and News contain fewer lines, but each line is much longer and more information-dense.
The combined dataset is nearly 583 MB — large enough to support a robust text prediction model.

Exploratory Data Analysis

Words Per Line Distribution

Average & Median Words Per Line
Source	Mean	Median
Blogs	43.0	29
News	34.2	31
Twitter	12.9	12

Figure 1 — Words-per-line distribution (top) and top-20 most frequent words by corpus (bottom)

Key findings:

Twitter sentences are shortest — tightly clustered below 20 words, reflecting the short-form nature of social media posts..
Blogs are the most verbose — right-skewed distribution with many lines exceeding 60 words.
Stop-words dominate everywhere — “the”, “and”, “to” top all three corpora; removing them will be essential before n-gram modelling.
“I” is unusually prominent in Twitter — reflects the first-person, conversational register of social media.
“said” is a News marker — journalism’s reliance on attributed quotations pushes it into the top 20.
News has the richest vocabulary — most unique words in the sample, owing to varied subject matter and formal register.

Future Plan

Text Prediction Algorithm

The next word prediction system will follow these steps:

Step 1 — Data Cleaning

Remove non-ASCII characters, URLs, email addresses, and numbers
Convert to lowercase; strip punctuation (retain sentence boundaries)
Remove profanity using a blocklist
Sample a representative subset if memory is a constraint

Step 2 — Tokenisation

Split cleaned text into individual tokens (words)
Stop-words are kept for prediction (context matters for next-word suggestions)

Step 3 — N-gram Construction

Build unigram, bigram, and trigram frequency tables stored as data frames
Apply Stupid Back-off smoothing to handle unseen n-grams efficiently

Step 4 — Prediction Logic

Given the user’s last 1–2 typed words, look up matching trigrams
Fall back to bigrams, then unigrams if no match found
Return the top 3–5 candidate next words with probability scores

Shiny App Design

    Next Word Predictor

    Type your text:
   ┌─────────────────────────────────────────┐
   │  I am going to the ...                  │
   └─────────────────────────────────────────┘

   Suggested next words:
   [ store ]  [ park ]  [ gym ]  [ beach ]

   ─────────────────────────────────────────
   Top Predictions (with confidence):
    1. store  —  42 %
    2. park   —  18 %
    3. gym    —  12 %

Key Shiny app features:

Text input box — prediction updates reactively as the user types
Word suggestion buttons — click to append the predicted word
Confidence bar chart — visualises prediction probabilities
Source selector — optional filter: Blogs / News / Twitter / All
Fast response — pre-computed n-gram tables loaded into memory at startup

Conclusion

This report demonstrates successful loading and exploratory analysis of the HC Corpora datasets. The data reveals distinct writing styles across sources — particularly Twitter’s brevity versus Blogs’ longer format. A back-off n-gram model and interactive Shiny app are planned to deliver efficient, real-time next-word prediction.

DS Capstone

Sandeep

2026-06-04