Methodology
Step 1: Define High-Level Concepts with Sets of Examples
- Purpose: Help the model interpret human-friendly concepts.
- Process: Collect examples that represent each concept, like “striped” patterns.
- Example: Use images of zebras, tigers, etc., to illustrate “striped.”
- Key Point: Concept examples can be from outside the model’s training data.
Step 2: Create the Concept Activation Vector (CAV)
- Purpose: Translate the concept into a form the model can understand (vector).
- Process:
- Extract activations from the model for both concept and non-concept images.
- Train a classifier to differentiate between concept and non-concept.
- Result: A vector representing the concept in the model’s activation space.
Steps 3 & 4: Model Sensitivity and TCAV Score
- Step 3: Sensitivity:
- Purpose: Measure concept impact.
- Process: Use directional derivative; higher values = stronger influence.
- Step 4: TCAV Score:
- Purpose: Quantify concept influence.
- Process: Calculate % of positive directional derivatives.
- Example: 80% score = concept influences 80% of predictions.
Steps 5 & 6: Validating CAVs and Comparing Concepts
- Step 5: Statistical Validation:
- Purpose: Confirm CAVs and TCAV scores aren’t random.
- Process: Generate random CAVs as control, then use a t-test for comparison.
- Key Point: Ensures TCAV results are statistically meaningful.
- Step 6: Relative TCAV:
- Purpose: Compare sensitivity to similar concepts.
- Process: Create a relative CAV to see which concept (e.g., “black hair” vs. “brown hair”) is more influential.
- Example: Shows if the model is more sensitive to “black hair” or “brown hair” in a “celebrity” class.