Executive Summary This report serves as a milestone for the Data Science Capstone Project. The goal is to build a predictive text model using a large corpus of text documents. This document outlines the exploratory data analysis (EDA) of the training data sets, summarizes the basic statistics, and describes the plan for building the final predictive algorithm and Shiny application.

1. Data Loading and Summary Statistics We are using the HC Corpora dataset, which consists of three files: Blogs, News, and Twitter. Below is a summary of the file sizes, line counts, and word counts.

Table 1: Data Summary
File Size_MB Lines Words
Blogs 200.42 899288 37546806
News 196.28 1010242 34762658
Twitter 159.36 2360148 30096690

The table above shows the massive size of the dataset. To make our analysis feasible on standard hardware, we will sample the data.

  1. Exploratory Analysis (Word Frequencies) We sampled 0.5% of the data to analyze the most frequent words.
##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  915339 48.9    6034550 322.3  5222733 279.0
## Vcells 6117458 46.7   87785668 669.8 98098762 748.5

As expected, stopwords like ‘the’, ‘and’, ‘to’ are the most frequent.

  1. Future Plans: Prediction Algorithm & App The Algorithm My approach for the prediction model will be:

N-gram Model: I will use Trigrams (3 words) and Bigrams (2 words) to predict the next word.

Backoff Strategy: If a Trigram is not found, the model will ‘back off’ to a Bigram.

Performance: I will remove rare words (singletons) to keep the model size small and fast.

The Shiny App The final app will be hosted on shinyapps.io. It will feature:

A text input box for user queries.

Real-time word prediction displayed instantly.

A clean and simple user interface.