Final Presentation

Charlie Wiebe

2025-04-21

Overview of My Process

  1. Loaded and explored craven_train.rds
  2. Created Profit as the target variable
  3. Engineered 20 features, avoiding leakage from revenue/cost
  4. Filtered down to the 10 best features
  5. Tested multiple models (linear, RF, XGBoost, etc.)
  6. Tuned and selected a random forest model, optimized via RMSE

Feature Engineering Philosophy

  • Focused on features that:
    • Reflect real-world product & sales traits
    • Avoid direct profit leakage
    • Capture meaningful signal (seasonality, platform spread, product design)

Feature Optimization Process

  • Step 1: Engineered 20+ Features
    Focused on sales patterns, seasonality, product complexity, and channel mix — while strictly excluding any variables used to calculate Profit.

  • Step 2: Trained Full Random Forest Model
    Used all engineered features (after cleaning, imputing, encoding, and removing correlated/nzv features) to train a baseline random forest model.

Feature Optimization Process

  • Step 3: Ranked Features by Variable Importance
    Used vip::vip() and varImp() to extract and rank features based on their contribution to reducing prediction error in the full RF model.

  • Step 4: RMSE Filtering via Forward Search
    Iteratively retrained models using top k variables (k = 1:10), recording the cross-validated RMSE at each step.
    Selected the top k = 10 features that minimized RMSE.

Feature Optimization Process

  • Step 5: Final Model with Top 10
    Retrained the final random forest model using only the top 10 ranked features.
    This model outperformed the linear baseline and maintained strong generalization.

Final Feature Selection

direct_sales_ratio – proportion of direct sales, stock_orders_pct – percent of orders for stock, Q1Sales, Q4Sales – seasonal performance, is_variant – listing type, product_age – based on year min, product_complexity – components × weight, web_sales_dominant – is web > 50%, mobile_sales_dominant – is mobile > 50%, platform_diversity – number of active sales channels, high_season – whether Q4 dominates

Model Comparison & RMSE

Random Forest Tuning

Final Model Performance

Best RMSE: 1883.94

Final model: Random Forest with mtry = 2

Handles non-linearity and interaction effects well

Variable Importance

Why This Model is Practical

Fast, interpretable, and robust to data quirks

All features available before product launch

Avoids hidden leakage from profit calculation

Flexible enough to adapt with new product lines

Potential Next Steps

Try interaction terms (e.g., product_age * Q4Sales)

Explore stacking RF + XGBoost

Add product category metadata

Track model drift over time