Data Science Capstone: Milestone Report

Executive Summary

This report outlines the exploratory data analysis (EDA) of the SwiftKey natural language dataset. As a developer, my primary objective is to understand the statistical distribution of words and phrases to build a memory-efficient predictive text engine. This document demonstrates successful data ingestion, cleaning, and initial n-gram modeling.

1. Data Audit (Summary Statistics)

The project utilizes three large text corpora: Blogs, News, and Twitter entries. Below is a summary of the raw data dimensions. Note that the data contains over 100 million words, requiring a sampling strategy to maintain performance on mobile-targeted applications.

Corpus Volume Analysis
Source	Line_Count	Word_Count	File_Size_MB
Twitter	2,360,148	30,373,583	159
Blogs	899,288	37,334,131	200
News	1,010,242	34,372,533	196

2. Pre-processing & Cleaning Pipeline

To prepare the data for modeling, I implemented a robust cleaning pipeline. The data contains significant “noise” (URLs, emojis, profanity) that must be filtered to create a “Safe for Work” (SFW) predictive model.

Cleaning Steps Performed:

Case Normalization: Converted all text to lowercase.
Sanitization: Removed punctuation, numbers, and special white spaces.
Filtering: Implemented a profanity blacklist to prevent inappropriate word suggestions.
Sampling: Utilized a 1% random sample of the data to build a prototype model that balances accuracy with computational speed.

3. Exploratory Analysis & Insights

The core of the prediction engine relies on word frequency distributions. My analysis confirms Zipf’s Law: a small number of words account for the majority of the language.

Most Frequent Unigrams (Single Words)

The chart below displays the most common words found in the combined sample. As expected, “stop words” like “the”, “to”, and “and” dominate the frequency counts.

Word Coverage Analysis

One of the most interesting findings is that roughly 5,000 unique words are enough to cover nearly 90% of all word instances in the corpus. For a developer, this is critical because it allows us to prune our dictionary significantly, ensuring the final app is lightweight and fast.

4. Roadmap for Algorithm and Shiny App

Moving forward, I will develop the prediction logic using the following technical roadmap:

N-gram Modeling: I will build Bigram and Trigram lookup tables to store word sequences.
Back-off Strategy: I will implement a “Katz’s Back-off” algorithm. If the user types a phrase the model hasn’t seen in the Trigram table, it will failover to the Bigram table, and finally to the most frequent individual words.
Memory Optimization: To ensure the app runs smoothly on a mobile browser via the Shiny server, I will remove n-grams that occur only once.
User Interface: The Shiny App will feature a real-time input box that suggests the top 3 most likely next words instantaneously as the user types.