Data Science Capstone - Milestone Report

1. Introduction

The goal of this project is to perform an exploratory data analysis on the provided text datasets (Blogs, News, and Twitter) and outline a plan for creating a text prediction algorithm. This report is designed to be clear and concise for a non-data scientist manager.

2. Summary Statistics

The dataset was downloaded and successfully loaded into R. The table below shows the basic summaries of the three files, including line counts and word counts.

##      File_Name Line_Count Word_Count
## 1   Blogs Data     899288   37546806
## 2    News Data    1010206   34761151
## 3 Twitter Data    2360148   30096690

Interesting Findings To understand the most common words, a 1% sample of the data was taken, cleaned (removing punctuation, numbers, and common English stopwords), and analyzed. Below is the histogram showing the top 15 most frequently used words in our text sample.
Future Plan for Prediction Algorithm and Shiny App Based on this analysis, the next steps for creating the prediction algorithm and Shiny app are:

Algorithm: I will build an N-gram model (unigrams, bigrams, and trigrams) to understand which words frequently appear together. This will help predict the next possible word based on the previous words typed.

Shiny App: I plan to create a simple, user-friendly interface. The user will type a phrase into a text box, and the app will use the algorithm to instantly suggest the most likely next word.

Data Science Capstone - Milestone Report

Vaibhav

June 2026

1. Introduction

2. Summary Statistics