CSCI/ DASC 6020: Machine Learning Team Project

Author

Group 1

Published

October 9, 2025

This project is based on Top Coder Challenge: Black Box Legacy Reimbursement System. The documents needed for this project are all available on this GitHub repository.

1 Legacy System Reverse Engineering Challenge

This project provides students with a realistic, challenging machine learning problem that combines technical skills with business acumen, preparing them for real-world data science roles while meeting rigorous academic standards.

Teams will reverse-engineer a 60-year-old travel reimbursement system using only historical data and employee interviews, applying machine learning techniques to discover hidden business logic patterns and create predictive models that replicate legacy system behavior.

Scenario: Your team has been hired as ML consultants by ACME Corporation. Their legacy reimbursement system has been running for 60 years, no one knows how it works, but it is still used daily. A new system has been built, but the ACME Corporation is confused by the differences in results. Your mission is to use machine learning to understand the original business logic and create a model that can explain and predict the legacy system’s behavior.

2 Project Learning Objectives

By completing this project, students will:

Apply Supervised Learning: Use regression and classification techniques to model complex business rules.
Feature Engineering: Extract meaningful features from limited input variables.
Model Interpretability: Develop explainable AI solutions for business stakeholders.
Data Analysis: Perform comprehensive exploratory data analysis on real-world business data.
Ensemble Methods: Combine multiple ML approaches to improve prediction accuracy.
Business Communication: Present technical solutions to non-technical stakeholders.
Collaborative Development: Work in teams using version control and agile methodologies.

3 Project Specification

3.1 Problem Statement

Input Variables (provided by the legacy system):

trip_duration_days: Number of days spent traveling (integer)
miles_traveled: Total miles traveled (integer)
total_receipts_amount: Total dollar amount of receipts (float)

Output Variable (to predict):

Single numeric reimbursement amount (float, rounded to 2 decimal places)

Success Criteria:

Exact matches: Cases within \pm\$0.01 of the expected output
Close matches: Cases within \pm\$1.00 of the expected output
Score: Lower is better (combines accuracy and precision)

3.2 Dataset Description

1,000 historical input/output examples from public_cases.json is available. Create a random sample of 750 examples for use in training the machine learning models. The remaining 250 examples should be used for testing.

Additional Resources:

Product Requirements Document (PRD) with business context (available at https://github.com/8090-inc/top-coder-challenge/blob/main/PRD.md).
Employee interview transcripts with system behavior hints (available at https://github.com/8090-inc/top-coder-challenge/blob/main/INTERVIEWS.md).
Domain knowledge about travel reimbursement policies.

4 Technical Requirements

4.1 Phase 1: Exploratory Data Analysis (Week 1-2: Oct 6 - Oct 19)

Deliverables:

Data Exploration Report (Jupyter notebook)
- Statistical summary of all input/output variables
- Distribution analysis and visualization
- Correlation analysis between inputs and outputs
- Outlier detection and analysis
- Missing data assessment
Business Logic Hypothesis (Technical report)
- Analysis of PRD and interview transcripts
- Proposed business rules and logic patterns
- Feature importance hypotheses
- Potential non-linear relationships identification
Feature Engineering Strategy
- Derived features (e.g., cost per mile, cost per day)
- Interaction terms and polynomial features
- Domain-specific transformations
- Feature scaling and normalization approaches

4.2 Phase 2: Model Development (Week 3-5: Oct 20 - Nov 9)

Required ML Approaches (teams must implement at least 4, choose across the categories):

Linear Regression Variants
- Simple linear regression
- Ridge/Lasso regression with regularization
- Polynomial regression
Tree-Based Methods
- Decision trees with interpretability analysis
- Random Forest with feature importance
- Gradient Boosting (XGBoost, LightGBM)
Advanced Techniques
- Support Vector Regression
- Neural Networks (MLPs)
- Ensemble methods (stacking, voting)
Rule-Based Learning
- Decision rule extraction
- Association rule mining
- Symbolic regression (optional bonus)

Model Evaluation Framework:

Cross-validation strategies (time-series aware if applicable)
Multiple evaluation metrics (MAE, RMSE, accuracy within thresholds)
Overfitting detection and prevention
Model interpretability analysis

Phase 3: System Integration (Week 6-7: Nov 10 - Nov 23)

Implementation Requirements:

Production-Ready Code
- Script must take exactly 3 parameters and output a single number
- Must run in under 5 seconds per test case
- Work without external dependencies (no network calls, databases, etc.)
- Error handling and input validation
Model Pipeline
- Feature preprocessing pipeline
- Model ensemble or selection logic
- Post-processing and rounding logic
- Comprehensive testing framework
Documentation
- Code documentation and comments
- Model architecture description
- Feature engineering rationale
- Deployment instructions

4.3 Phase 4: Business Communication (Week 8: Nov 24 - Nov 30)

Final Deliverables:

Technical Report (15-20 pages)
- Executive summary for business stakeholders
- Methodology and approach description
- Model performance analysis and comparison
- Business insights and discovered patterns
- Recommendations for system improvement
Business Presentation (20 minutes + Q&A)
- Problem context and approach
- Key findings and model insights
- Explanation of legacy system behavior
- Recommendations for SomeName, LLC.
Code Repository
- Complete, documented codebase
- Reproducible analysis notebooks (Quarto/ RMarkdown)
- Model artifacts and evaluation results
- README with setup and usage instructions

5 Team Structure and Roles

5.1 Recommended Team Roles (3-4 members):

Data Scientist/Analyst - Skylar

Lead exploratory data analysis
Feature engineering and selection
Statistical analysis and hypothesis testing
Data visualization and reporting

ML Engineer - Cynthia

Model development and optimization
Hyperparameter tuning and validation
Pipeline development and automation
Performance optimization

Business Analyst - Rebecca

PRD and interview analysis
Domain expertise development
Business requirements translation
Stakeholder communication

Software Engineer - Curtis

Production code development
Testing framework implementation
Deployment and integration
Code quality and documentation

6 Evaluation Criteria

6.1 Technical Performance (40%)

Model Accuracy: Prediction accuracy on test set
Methodology: Soundness of ML approach and evaluation
Innovation: Creative problem-solving and advanced techniques
Code Quality: Clean, documented, reproducible code

6.2 Business Understanding (25%)

Problem Analysis: Depth of business problem understanding
Insight Generation: Quality of discovered patterns and rules
Interpretability: Ability to explain model behavior
Recommendations: Practical business recommendations

6.3 Communication (20%)

Technical Report: Clarity, completeness, and professionalism
Presentation: Effective communication to business audience
Documentation: Quality of code and project documentation
Collaboration: Evidence of effective teamwork

6.4 Process and Methodology (15%)

Project Management: Adherence to timeline and deliverables
Version Control: Proper use of Git and collaborative development
Reproducibility: Ability for others to replicate results
Ethical Considerations: Discussion of model fairness and bias

7 Weekly Milestones

7.1 Week 1 (10/6 - 10/12): Project Setup and Initial Analysis

Team formation and role assignment
Repository setup and data exploration
Initial PRD and interview analysis
Preliminary data insights presentation

7.2 Week 2 (10/13 - 10/19): Feature Engineering and Baseline Models

Complete EDA report
Feature engineering implementation
Baseline model development
Business logic hypothesis document

7.3 Week 3-4 (10/20 - 11/2): Advanced Model Development

Implementation of required ML approaches
Model comparison and evaluation
Hyperparameter optimization
Mid-project progress presentation

7.5 Week 7 (11/17 - 11/23): Integration and Validation

Final model validation and testing
Performance optimization
Documentation completion
Practice presentation

7.6 Week 8 (11/24 - 11/30): Final Presentation and Submission

Final business presentation
Technical report submission
Code repository finalization
Peer evaluation and reflection

8 Advanced Extensions (Bonus Points)

8.1 Interpretable AI Challenge

Implement SHAP or LIME for model interpretability
Develop custom visualization for business rule explanation
Create decision tree surrogate models
Extract symbolic rules from ensemble models

8.2 Time Series Analysis

Investigate temporal patterns in reimbursement policies
Implement time-aware cross-validation
Analyze policy changes over the 60-year period
Develop change-point detection algorithms

8.3 Automated Machine Learning

Implement automated feature selection
Develop automated model selection pipeline
Create automated hyperparameter optimization
Build ensemble model selection framework

8.4 Business Intelligence Dashboard

Create interactive dashboard for business users
Implement real-time prediction interface
Develop what-if scenario analysis tools
Build model performance monitoring system

9 Resources and Support

9.1 Technical Resources

Course lecture materials on supervised learning (ECU Canvas)
Scikit-learn, XGBoost, and TensorFlow documentation (On the Web)
Jupyter notebook templates and examples
Sample code for model evaluation and interpretation

9.2 Business Context

Travel and expense management industry resources
Corporate reimbursement policy examples
Business analysis and requirements gathering guides
Stakeholder communication best practices

9.3 Collaboration Tools

GitHub repository templates
Slack workspace for team communication (or any other tool such as Jira, Taiga)
Peer review and feedback sessions

10 Assessment Rubric

10.1 Excellent (A): 90-100%

Exceptional model performance with innovative approaches
Comprehensive business analysis with actionable insights
Professional-quality code and documentation
Outstanding presentation and communication skills

10.2 Good (B): 80-89%

Strong model performance with sound methodology
Good business understanding with relevant insights
Well-organized code and adequate documentation
Clear presentation with minor areas for improvement

10.3 Satisfactory (C): 70-79%

Acceptable model performance with basic approaches
Basic business understanding with limited insights
Functional code with minimal documentation
Adequate presentation meeting basic requirements

10.4 Needs Improvement (D/F): Below 70%

Poor model performance or flawed methodology
Limited business understanding or insights
Incomplete or non-functional code
Poor presentation or communication