This report outlines the exploratory data analysis (EDA) of the SwiftKey natural language dataset. As a developer, my primary objective is to understand the statistical distribution of words and phrases to build a memory-efficient predictive text engine. This document demonstrates successful data ingestion, cleaning, and initial n-gram modeling.
The project utilizes three large text corpora: Blogs, News, and Twitter entries. Below is a summary of the raw data dimensions. Note that the data contains over 100 million words, requiring a sampling strategy to maintain performance on mobile-targeted applications.
| Source | Line_Count | Word_Count | File_Size_MB |
|---|---|---|---|
| 2,360,148 | 30,373,583 | 159 | |
| Blogs | 899,288 | 37,334,131 | 200 |
| News | 1,010,242 | 34,372,533 | 196 |
To prepare the data for modeling, I implemented a robust cleaning pipeline. The data contains significant “noise” (URLs, emojis, profanity) that must be filtered to create a “Safe for Work” (SFW) predictive model.
Cleaning Steps Performed:
The core of the prediction engine relies on word frequency distributions. My analysis confirms Zipf’s Law: a small number of words account for the majority of the language.
The chart below displays the most common words found in the combined sample. As expected, “stop words” like “the”, “to”, and “and” dominate the frequency counts.
One of the most interesting findings is that roughly 5,000 unique words are enough to cover nearly 90% of all word instances in the corpus. For a developer, this is critical because it allows us to prune our dictionary significantly, ensuring the final app is lightweight and fast.
Moving forward, I will develop the prediction logic using the following technical roadmap: