Data Science Capstone: Milestone Report

Executive Summary

This report provides an exploratory analysis of the digital text data provided for our predictive text project. The goal of this project is to build a smart keyboard algorithm that predicts the next word a user wants to type, similar to technologies found on modern smartphones.

We successfully loaded and analyzed three large datasets containing text from Blogs, News articles, and Twitter. This report outlines the basic structure of this data, uncovers key patterns in word usage, and outlines our strategy for building the final predictive application.

1. Data Loading and Basic Summary Statistics

We successfully imported the three English text files (en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt). Because these files are exceptionally large, we captured their core metrics—file sizes, line counts, and total words—to understand the scale of data we are working with.

Table 1: Summary Statistics of Text Datasets
File_Source	File_Size_MB	Line_Count	Word_Count
Blogs	200.42	899,288	37,546,806
News	196.28	1,010,206	34,761,151
Twitter	159.36	2,360,148	30,096,690

Key Takeaway from Table 1:

Twitter gives us the highest volume of short, conversational text (over 2.3 million lines).
Blogs contain fewer lines but feature longer, more descriptive compositions, resulting in the highest overall word count (over 37 million words).

2. Exploratory Findings: Word Frequencies

To build a predictor, we need to know which words and word combinations appear most frequently. Because processing millions of lines requires heavy computer memory, we took a random 1% sample of the data to uncover major linguistic patterns.

We cleaned the text by removing punctuation, numbers, and converting everything to lowercase. We then analyzed Unigrams (single words) and Bigrams (two-word combinations).

Top 15 Most Common Single Words

As expected, standard filler words like “the”, “to”, and “and” dominate English text.

Top 15 Most Common Two-Word Phrases (Bigrams)

Analyzing phrases gives us a clearer picture of how words link together naturally (e.g., “of the”, “in the”, “to the”).

3. Plan for the Prediction Algorithm and Shiny App

Based on our exploratory findings, we have a clear path forward for creating the final data product:

The Prediction Algorithm (N-gram Model)

N-gram Database: We will build a matrix of single words, two-word phrases, and three-word phrases (Trigrams).
Back-off Strategy: If a user types two words, the algorithm will look at our Trigram data first to predict the 3rd word. If it cannot find a match, it will “back off” to the Bigram data (looking only at the last word typed) to make the best guess.
Data Optimization: To ensure the app runs fast on mobile devices or web browsers, we will filter out words or combinations that only appear once, reducing file sizes by up to 70% without sacrificing noticeable accuracy.

The Shiny Application

We will build a simple, clean interactive web interface using R Shiny: * Input text box: A space where the user can type any sentence. * Real-time prediction buttons: The app will instantly display the top 3 predicted next words below the text box, exactly like a smartphone keyboard.