Introduction

This milestone report documents the exploratory data analysis. The goal is to build a predictive text application, similar to smartphone keyboard suggestions, that predicts the next word a user will type.

The three source corpora used are:

Source Description
Blogs Personal blog entries
News News articles
Twitter Short-form social-media posts

Loading the Data

## Blogs lines: 899288
## News lines: 1010206
## Twitter lines: 2360148

File Summary Statistics

File Summary Table
File Lines Words Size (MB)
Blogs 899288 37334131 210.2
News 1010206 34371031 205.8
Twitter 2360148 30373583 167.1

Key Observations

  • Twitter has the most lines (~2.36 million), but each entry is very short due to the character limit, making it more fragmented than the other datasets.

  • Blogs and News contain fewer lines, but each line is much longer and more information-dense.

  • The combined dataset is nearly 583 MB — large enough to support a robust text prediction model.

Exploratory Data Analysis

Words Per Line Distribution

Average & Median Words Per Line
Source Mean Median
Blogs 43.0 29
News 34.2 31
Twitter 12.9 12
Figure 1 — Words-per-line distribution (top) and top-20 most frequent words by corpus (bottom)

Figure 1 — Words-per-line distribution (top) and top-20 most frequent words by corpus (bottom)

Key findings:

  • Twitter sentences are shortest — tightly clustered below 20 words, reflecting the short-form nature of social media posts..
  • Blogs are the most verbose — right-skewed distribution with many lines exceeding 60 words.
  • Stop-words dominate everywhere — “the”, “and”, “to” top all three corpora; removing them will be essential before n-gram modelling.
  • “I” is unusually prominent in Twitter — reflects the first-person, conversational register of social media.
  • “said” is a News marker — journalism’s reliance on attributed quotations pushes it into the top 20.
  • News has the richest vocabulary — most unique words in the sample, owing to varied subject matter and formal register.

Future Plan

Text Prediction Algorithm

The next word prediction system will follow these steps:

Step 1 — Data Cleaning

  • Remove non-ASCII characters, URLs, email addresses, and numbers
  • Convert to lowercase; strip punctuation (retain sentence boundaries)
  • Remove profanity using a blocklist
  • Sample a representative subset if memory is a constraint

Step 2 — Tokenisation

  • Split cleaned text into individual tokens (words)
  • Stop-words are kept for prediction (context matters for next-word suggestions)

Step 3 — N-gram Construction

  • Build unigram, bigram, and trigram frequency tables stored as data frames
  • Apply Stupid Back-off smoothing to handle unseen n-grams efficiently

Step 4 — Prediction Logic

  • Given the user’s last 1–2 typed words, look up matching trigrams
  • Fall back to bigrams, then unigrams if no match found
  • Return the top 3–5 candidate next words with probability scores

Shiny App Design

    Next Word Predictor

    Type your text:
   ┌─────────────────────────────────────────┐
   │  I am going to the ...                  │
   └─────────────────────────────────────────┘

   Suggested next words:
   [ store ]  [ park ]  [ gym ]  [ beach ]

   ─────────────────────────────────────────
   Top Predictions (with confidence):
    1. store  —  42 %
    2. park   —  18 %
    3. gym    —  12 %

Key Shiny app features:

  • Text input box — prediction updates reactively as the user types
  • Word suggestion buttons — click to append the predicted word
  • Confidence bar chart — visualises prediction probabilities
  • Source selector — optional filter: Blogs / News / Twitter / All
  • Fast response — pre-computed n-gram tables loaded into memory at startup

Conclusion

This report demonstrates successful loading and exploratory analysis of the HC Corpora datasets. The data reveals distinct writing styles across sources — particularly Twitter’s brevity versus Blogs’ longer format. A back-off n-gram model and interactive Shiny app are planned to deliver efficient, real-time next-word prediction.