Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Group-6

Jipeng Liu
Binita Dahal
Md Shah Mominul Islam Momin

Introduction

With the widespread use of Machine Learning (ML) in various fields, understanding the behavior of these complex models is particularly important. Traditionally, interpretable methods usually explain model predictions through input features, but this faces challenges in deep learning. The features (e.g., pixel values) on which the model operates do not correspond to high-level concepts understood by humans, and the model’s internal state is difficult to interpret.

Concept Activation Vectors (CAV)

To address these issues, the authors propose Concept Activation Vectors (CAV). This approach maps the internal state of the model to human-understandable high-level concepts through user-defined “concepts” using vector space transformations. The TCAV (Test of Concept Activation Vectors) method further quantifies the extent to which a particular concept affects model predictions. This approach does not require re-training the model, allowing flexibility in concept customization and global model interpretation.

Methodology

User-defined Concepts: TCAV allows users to define concepts using example sets (e.g., “striped” or “curly” patterns), enabling flexible interpretation beyond existing features.

Concept Activation Vectors (CAVs): CAVs represent concepts as vectors by distinguishing concept samples from random activations, quantifying model sensitivity to each concept.

Methodology

Conceptual Sensitivity of CAV: CAV quantifies sensitivity (S) of class k related to a concept C based on the following formula:

[ S{C,k,l}(x) = h{l,k}(fl(x)) vlC ]

where, h{l,k} : Rm → R

Vlc ∈ Rm is a unit CAV vector for concept C f1(x) represents activations for input x in layer l

Testing with CAV: For k class label in a supervised learning model with X_k_ inputs,

gives fraction of k-class inputs whose l-layer activation vector was positively influenced by concept C, TCAVQ(C,k,l) ∈ [0, 1].

Methodology

Statistical Significance Testing

  • Statistical significance testing is proposed.
  • Lots of iteration of a series of inputs
  • Perform two-tailed analysis of TCAV scores based on multiple iterations.

TCAV extensions: Relative TCAV

  • Semantically related concepts produce CAV’s that are far from orthogonal.

  • Upon selecting different concepts and training the classifier for those objects, we obtain a vector Vc,d in layer l

  • It intuitively defines a subspace f(x) which can measure if x is more related to C or D.

Results

Sorting Images with CAVs

  • Following results were found:

Empirical Deep Dream

  • This technique functions to visualise a pattern that would maximally activate a neuron, set of neuron, or random directions.

Insights and biases

  1. Further confirm TCAV’s utility

  2. reveal baises and,

  3. show where the concepts are learned.

Gaining Insights using TCAV

  1. Some results confirmed our common sense intuition

  2. These networks were sensitive to gender and race

  3. Statistical significance testing successfully filtered out spurious results.

Where Concepts are Learned?

  • Lower layers operate as lower level features detectors. Higher layer uses the combinations of these lower level features to infer higher level features.

Quantitative Evaluation and Saliency Maps

Quantitative Evaluation of TCAV

  • TCAV scores reflected the true concept used by models.
  • TCAV scores are consistent with ground truth.

Evaluation of Saliency Maps

  • Saliency maps failed to reliably indicate true concept importance (52% accuracy).
  • TCAV provided clearer, more interpretable concept importance.

Medical application

  • Diabetic retinopathy (DR) level (0 to 4) using retinal images.

  • Identified diagnostic concepts relevant for each DR level.

  • High TCAV scores for “microaneurysms” at DR level 4.

  • Inconsistencies between model and expert knowledge at lower DR levels.

Conclusion and Future Work

  • Benefits of TCAV:
    • Human-friendly, interpretable model insights.
    • Works post-hoc on any model, adaptable for various applications.
    • Effective in identifying biases and understanding model focus.
  • Future Directions:
    • Apply TCAV to other data types (audio, text, etc.).
    • Use TCAV for adversarial detection and robustness testing.
    • Potential for automatic concept identification.