title: “Data Science Capstone - Milestone Report” author: “Syed Mohammed” date: “2026-06-29” output: html_document ———————
The purpose of this milestone report is to explore the HC Corpora dataset provided for the Johns Hopkins Data Science Capstone project. The dataset contains text collected from blogs, news articles and Twitter posts. This report summarizes the basic characteristics of the data and presents a simple visualization that will help guide the development of a predictive text model.
| Source | Lines | Characters |
|---|---|---|
| Blogs | 899288 | 206824505 |
| News | 1010206 | 203214543 |
| 2360148 | 162096241 |
The Twitter dataset contains the largest number of lines, while the Blogs and News datasets contain fewer but generally longer pieces of text. The three datasets provide different writing styles, making them suitable for building a predictive text model.
The next stage of the project will involve cleaning the text, removing unnecessary characters and punctuation, creating n-grams, and building a predictive model that can suggest the next word based on previous words.