Data Science Capstone: Milestone Report

Executive Summary This report serves as a milestone for the Data Science Capstone Project. The goal is to build a predictive text model using a large corpus of text documents. This document outlines the exploratory data analysis (EDA) of the training data sets, summarizes the basic statistics, and describes the plan for building the final predictive algorithm and Shiny application.

1. Data Loading and Summary Statistics We are using the HC Corpora dataset, which consists of three files: Blogs, News, and Twitter. Below is a summary of the file sizes, line counts, and word counts.

Table 1: Data Summary
File	Size_MB	Lines	Words
Blogs	200.42	899288	37546806
News	196.28	1010242	34762658
Twitter	159.36	2360148	30096690

The table above shows the massive size of the dataset. To make our analysis feasible on standard hardware, we will sample the data.

Exploratory Analysis (Word Frequencies) We sampled 0.5% of the data to analyze the most frequent words.

##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  915339 48.9    6034550 322.3  5222733 279.0
## Vcells 6117458 46.7   87785668 669.8 98098762 748.5

As expected, stopwords like ‘the’, ‘and’, ‘to’ are the most frequent.

Future Plans: Prediction Algorithm & App The Algorithm My approach for the prediction model will be:

N-gram Model: I will use Trigrams (3 words) and Bigrams (2 words) to predict the next word.

Backoff Strategy: If a Trigram is not found, the model will ‘back off’ to a Bigram.

Performance: I will remove rare words (singletons) to keep the model size small and fast.

The Shiny App The final app will be hosted on shinyapps.io. It will feature:

A text input box for user queries.

Real-time word prediction displayed instantly.

A clean and simple user interface.

Data Science Capstone: Milestone Report

Muhammad Zubair

2026-02-07