Data Science Capstone: Milestone Report

1. Executive Summary

The goal of this project is to build a predictive text engine. This report demonstrates the initial exploratory analysis of the training data, identifies major features of the language datasets, and outlines the plan for the final Shiny application.

2. Data Loading and Summary Statistics

We have successfully loaded three primary datasets: Blogs, News, and Twitter. Because the raw data is extremely large, we have taken a representative sample for this analysis to ensure efficiency.

Table 1: Overview of Raw Datasets
Data Source	Total Lines	Word Count	File Size (MB)
Blogs	899,288	37.3M	200
News	1,010,242	34.4M	196
Twitter	2,360,148	30.3M	159

3. Exploratory Findings: Common Word Pairs

A key part of predicting text is understanding Bigrams (pairs of words that appear together). Below are the most frequent word pairs found in our sampled data.

4. Interesting Discoveries

Vocabulary Coverage: We found that a small percentage of unique words account for the majority of language use.

Data Cleanliness: The data contains significant “noise” (punctuation, emojis, profanity) that must be filtered to ensure accurate predictions.

5. Goals for the Prediction Algorithm & Shiny App

My strategy for creating the predictive tool is as follows:

The Algorithm: I will use an “N-gram” model with a “Back-off” strategy. If a three-word sequence isn’t found, the model will look for a two-word sequence.

The Shiny App: The app will feature a simple text box. As the user types, the top 3 most likely next words will be displayed instantly.

Performance: The model will be optimized to run quickly on mobile devices with low memory usage.

This report was created as a milestone for the Data Science Capstone project.