Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Introduction

With the widespread use of Machine Learning (ML) in various fields, understanding the behavior of these complex models is particularly important. Traditionally, interpretable methods usually explain model predictions through input features, but this faces challenges in deep learning. The features (e.g., pixel values) on which the model operates do not correspond to high-level concepts understood by humans, and the model’s internal state is difficult to interpret.

Concept Activation Vectors (CAV)

To address these issues, the authors propose Concept Activation Vectors (CAV). This approach maps the internal state of the model to human-understandable high-level concepts through user-defined “concepts” using vector space transformations. The TCAV (Test of Concept Activation Vectors) method further quantifies the extent to which a particular concept affects model predictions. This approach does not require re-training the model, allowing flexibility in concept customization and global model interpretation.

Related Work

Interpretable Methods use simple models or post-process complex models; TCAV is a post-processing method interpreting human concepts.
Explanatory Methods like saliency maps highlight important pixels but are limited to specific data points and don’t support custom concepts.
Latent Space Directions show that neural network directions can represent meaningful concepts; TCAV leverages this for global, flexible concept testing.

In sum, TCAV merges strengths from these methods to assess concept impact on model predictions.

Methodology

3.1 User-defined Concepts: TCAV allows users to define concepts using example sets (e.g., “striped” or “curly” patterns), enabling flexible interpretation beyond existing features.

3.2 Concept Activation Vectors (CAVs): CAVs represent concepts as vectors by distinguishing concept samples from random activations, quantifying model sensitivity to each concept.