title: “Data Science Capstone - Milestone Report” author: “Syed Mohammed” date: “2026-06-29” output: html_document ———————

Introduction

The purpose of this milestone report is to explore the HC Corpora dataset provided for the Johns Hopkins Data Science Capstone project. The dataset contains text collected from blogs, news articles and Twitter posts. This report summarizes the basic characteristics of the data and presents a simple visualization that will help guide the development of a predictive text model.

Basic Summary Statistics

Source	Lines	Characters
Blogs	899288	206824505
News	1010206	203214543
Twitter	2360148	162096241

Visualization

Findings

The Twitter dataset contains the largest number of lines, while the Blogs and News datasets contain fewer but generally longer pieces of text. The three datasets provide different writing styles, making them suitable for building a predictive text model.

Future Work

The next stage of the project will involve cleaning the text, removing unnecessary characters and punctuation, creating n-grams, and building a predictive model that can suggest the next word based on previous words.