Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Jipeng

Introduction

With the widespread use of Machine Learning (ML) in various fields, understanding the behavior of these complex models is particularly important. Traditionally, interpretable methods usually explain model predictions through input features, but this faces challenges in deep learning. The features (e.g., pixel values) on which the model operates do not correspond to high-level concepts understood by humans, and the model’s internal state is difficult to interpret.

Concept Activation Vectors (CAV)

To address these issues, the authors propose Concept Activation Vectors (CAV). This approach maps the internal state of the model to human-understandable high-level concepts through user-defined “concepts” using vector space transformations. The TCAV (Test of Concept Activation Vectors) method further quantifies the extent to which a particular concept affects model predictions. This approach does not require re-training the model, allowing flexibility in concept customization and global model interpretation.

Methodology

3.1 User-defined Concepts: TCAV allows users to define concepts using example sets (e.g., “striped” or “curly” patterns), enabling flexible interpretation beyond existing features.

3.2 Concept Activation Vectors (CAVs): CAVs represent concepts as vectors by distinguishing concept samples from random activations, quantifying model sensitivity to each concept.