Data Strategy Team
2026-06-09
The Challenge: Previous clustering attempts failed because physical dimensions (weight, volume) and logistics failures drowned out true product quality. A 1-star review due to a 30-day shipping delay was incorrectly penalizing the product itself.
The Solution: We engineered a pipeline that strictly isolates pure product dissatisfaction from logistics failures. We only evaluate reviews for orders that were delivered on-time or early.
The Result: A clean, actionable segmentation separating our flawless premium products from our high-defect liabilities.
To find the actual customer experience, we engineered two core metrics:
Note: All physical product dimensions and shipping dates were explicitly dropped from the model after calculating logistics delay to prevent distance matrix corruption.
We first run Hierarchical Clustering (Ward’s Method) using our isolated quality metrics. Because bounded metrics (like 1 to 5 stars) often form a continuous density cloud, the algorithm mathematically “slices” the data.
To make these insights highly actionable for Procurement and Quality Assurance teams, we pivot to a Business Logic Matrix. This definitively groups products based on acceptable risk thresholds.
Based on our strictly isolated quality segmentation, here is the breakdown of our active product catalog:
| Segment | Total_Products | Avg_Score | Avg_Defect_Rate |
|---|---|---|---|
|
207 | 4.58 | 5.1% |
|
178 | 3.53 | 17.8% |
|
115 | 2.41 | 45.2% |
Moving from Reactive to Proactive While our current clustering model tells us which products have historically failed, our next initiative uses Supervised Machine Learning to predict low-quality customer experiences (1-star or 2-star reviews) before they even happen.
Our Modeling Strategy features 3 Tiers: 1. Model A (Pre-Shipment Risk): A Logistic Regression model estimating risk purely based on listing attributes (price, weight, freight cost, description length, and photos). 2. Model B (Full Customer Experience): Adds real-world delivery outcomes (total delivery time, delay in days) to assess the complete journey. 3. Model C (Non-linear Challenger): A powerful Random Forest algorithm designed to capture complex interactions between all variables.
By training on 80% of our historical data and evaluating on a 20% holdout set, we anticipate strong predictive power.
The Random Forest and Logistic Regression models will output feature importance and odds ratios, answering: What exactly drives a negative review?
How we will use this: Every new order will be
scored in real-time. Orders flagged as high-risk will trigger automated
workflows—such as proactive customer service reach-outs or expedited
shipping—saving the customer relationship before it sours.