Executive Summary This report serves as a milestone for the Data Science Capstone Project. The goal is to build a predictive text model using a large corpus of text documents. This document outlines the exploratory data analysis (EDA) of the training data sets, summarizes the basic statistics, and describes the plan for building the final predictive algorithm and Shiny application.
1. Data Loading and Summary Statistics We are using the HC Corpora dataset, which consists of three files: Blogs, News, and Twitter. Below is a summary of the file sizes, line counts, and word counts.
| File | Size_MB | Lines | Words |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546806 |
| News | 196.28 | 1010242 | 34762658 |
| 159.36 | 2360148 | 30096690 |
The table above shows the massive size of the dataset. To make our analysis feasible on standard hardware, we will sample the data.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 915339 48.9 6034550 322.3 5222733 279.0
## Vcells 6117458 46.7 87785668 669.8 98098762 748.5
As expected, stopwords like ‘the’, ‘and’, ‘to’ are the most frequent.
N-gram Model: I will use Trigrams (3 words) and Bigrams (2 words) to predict the next word.
Backoff Strategy: If a Trigram is not found, the model will ‘back off’ to a Bigram.
Performance: I will remove rare words (singletons) to keep the model size small and fast.
The Shiny App The final app will be hosted on shinyapps.io. It will feature:
A text input box for user queries.
Real-time word prediction displayed instantly.
A clean and simple user interface.