Final Presentation

Charlie Wiebe

2025-04-21

Overview of My Process

Loaded and explored craven_train.rds
Created Profit as the target variable
Engineered 20 features, avoiding leakage from revenue/cost
Filtered down to the 10 best features
Tested multiple models (linear, RF, XGBoost, etc.)
Tuned and selected a random forest model, optimized via RMSE

Feature Engineering Philosophy

Focused on features that:
- Reflect real-world product & sales traits
- Avoid direct profit leakage
- Capture meaningful signal (seasonality, platform spread, product design)

Feature Optimization Process

Step 1: Engineered 20+ Features
Focused on sales patterns, seasonality, product complexity, and channel mix — while strictly excluding any variables used to calculate Profit.
Step 2: Trained Full Random Forest Model
Used all engineered features (after cleaning, imputing, encoding, and removing correlated/nzv features) to train a baseline random forest model.

Feature Optimization Process

Step 3: Ranked Features by Variable Importance
Used vip::vip() and varImp() to extract and rank features based on their contribution to reducing prediction error in the full RF model.
Step 4: RMSE Filtering via Forward Search
Iteratively retrained models using top k variables (k = 1:10), recording the cross-validated RMSE at each step.
Selected the top k = 10 features that minimized RMSE.

Feature Optimization Process

Step 5: Final Model with Top 10
Retrained the final random forest model using only the top 10 ranked features.
This model outperformed the linear baseline and maintained strong generalization.

Final Feature Selection

direct_sales_ratio – proportion of direct sales, stock_orders_pct – percent of orders for stock, Q1Sales, Q4Sales – seasonal performance, is_variant – listing type, product_age – based on year min, product_complexity – components × weight, web_sales_dominant – is web > 50%, mobile_sales_dominant – is mobile > 50%, platform_diversity – number of active sales channels, high_season – whether Q4 dominates

Model Comparison & RMSE

Random Forest Tuning

Final Model Performance

Best RMSE: 1883.94

Final model: Random Forest with mtry = 2

Handles non-linearity and interaction effects well

Variable Importance

Why This Model is Practical

Fast, interpretable, and robust to data quirks

All features available before product launch

Avoids hidden leakage from profit calculation

Flexible enough to adapt with new product lines

Potential Next Steps

Try interaction terms (e.g., product_age * Q4Sales)

Explore stacking RF + XGBoost

Add product category metadata

Track model drift over time