2025-04-21
Step 1: Engineered 20+ Features
Focused on sales patterns, seasonality, product complexity, and channel mix — while strictly excluding any variables used to calculate Profit.
Step 2: Trained Full Random Forest Model
Used all engineered features (after cleaning, imputing, encoding, and removing correlated/nzv features) to train a baseline random forest model.
Step 3: Ranked Features by Variable Importance
Used vip::vip() and varImp() to extract and rank features based on their contribution to reducing prediction error in the full RF model.
Step 4: RMSE Filtering via Forward Search
Iteratively retrained models using top k variables (k = 1:10), recording the cross-validated RMSE at each step.
Selected the top k = 10 features that minimized RMSE.
random forest model using only the top 10 ranked features.direct_sales_ratio – proportion of direct sales, stock_orders_pct – percent of orders for stock, Q1Sales, Q4Sales – seasonal performance, is_variant – listing type, product_age – based on year min, product_complexity – components × weight, web_sales_dominant – is web > 50%, mobile_sales_dominant – is mobile > 50%, platform_diversity – number of active sales channels, high_season – whether Q4 dominates
Best RMSE: 1883.94
Final model: Random Forest with mtry = 2
Handles non-linearity and interaction effects well
Fast, interpretable, and robust to data quirks
All features available before product launch
Avoids hidden leakage from profit calculation
Flexible enough to adapt with new product lines
Try interaction terms (e.g., product_age * Q4Sales)
Explore stacking RF + XGBoost
Add product category metadata
Track model drift over time