Interpretability Beyond Feature Attribution {TCAV}

TLDR

Okay, so the basic approach is to define some high-level concept, such as 'stripes' in an image classification task or 'gender' in an NLP task. I think the basic idea is that given features that you {as a human expert, or whatnot} has determined represents some high-level concept, you record the resulting feature activations from one {or presumably more} hidden layer{s} in the network. Similarly, you record feature activations from input features that does not represent the concept. Maybe this is done over a variety of related feature inputs. Then, when predicting on some actual set of input features (like a picture of a Zebra or a prompt), you basically just do a frequentist (rejecting null hypothesis) statistical test to see if the feature activations where especially similar to e.g. 'stripes' when processing the image of a Zebra.

This seems fine and useful in some circumstances, but also defining the features and not-features for a high-level concept can be fairly ambiguous and tricky, and not rejecting the null hypothesis isn't really superb evidence that the concept isn't related to the output (as feature activations can change depending on the context of other input features, and you might be defining your input features poorly, etc).

Background and Problem Statement

One of the main issues with neural networks is their "black box" nature. While they perform incredibly well on a wide variety of tasks, it is often challenging to understand the reasons behind their decisions. This lack of interpretability can be problematic in fields where transparency is crucial, such as healthcare or law.

The authors propose a new method to interpret the output of a neural network called Testing with Concept Activation Vectors {TCAV}. This approach provides a way to understand the influence of high-level concepts, such as "stripes" in an image classification task or "gender" in a language processing task, on the decisions made by the model. But at least it can be useful for identifying problem areas (e.g. gender bias).

Concept Activation Vectors {CAVs}

The key idea in TCAV is the Concept Activation Vector {CAV}. A CAV for a concept (C) is a vector in the activation space of a hidden layer in the network. To obtain a CAV, we first need to collect a set of examples that represent the concept, and a set that does not. For instance, if our concept is "striped", we might gather a collection of images of striped and non-striped objects.

The activations from these two sets are then used to train a binary classifier, such as a linear classifier, where the positive class corresponds to the concept and the negative class to the non-concept. The decision boundary of this classifier, in the high-dimensional space of the layer activations, represents the CAV.

Mathematically, if ( A^+ ) and ( A^- ) are the sets of activations for the concept and non-concept examples respectively, and ( w ) is the weight vector of the trained linear classifier, then the CAV is given by the vector ( w ).

Quantitative Testing with CAVs

Once the CAVs have been computed, they are used to interpret the decisions of the neural network. The authors propose a statistical test to determine whether the influence of a concept on the network's output is statistically significant.

Given an input ( x ) to the network and a concept C, we first compute the activations ( A ) of a hidden layer for the input. We then project these activations onto the CAV associated with the concept. The resulting scalar is called the TCAV score and measures the alignment of the input with the concept.

The TCAV score for a concept C, an input ( x ), and a hidden layer ( l ) is given by:

[ TCAV_{C,l}(x) = \frac{A_l(x) \cdot CAV_C}{||A_l(x)|| ||CAV_C||} ]

The TCAV score is then used to perform a directional derivative test. The null hypothesis is that the TCAV score is not significantly different from 0. If the p-value is less than a predetermined threshold, then the null hypothesis is rejected, and we conclude that the concept has a significant impact on the network's decision.

Results and Discussion

The authors applied the TCAV method to different tasks including image classification and sentence sentiment analysis. The results demonstrated that TCAV can provide meaningful interpretations of the decisions made by the neural networks.

For instance, in the image classification task, the authors showed that the concept of

"stripes" influenced the classification of images as "zebra". They found that the TCAV score for the "stripes" concept was significantly different from zero, indicating that the presence of stripes was an important factor in the classification decision.

In the sentence sentiment analysis task, they found that the gender of the subject in the sentence significantly influenced the sentiment score assigned by the network. This type of bias, which might be unintentional, can be identified using TCAV.

This paper is important as it provides a methodology to interpret the decisions made by complex neural networks in terms of understandable concepts. This not only helps to understand the decision-making process but can also uncover potential biases in the model's decisions. The TCAV method can be applied to any type of neural network and does not require any modifications to the network's architecture or training procedure.

Furthermore, the TCAV method provides a quantitative measure of the influence of a concept, which can be used for statistical hypothesis testing. This allows for rigorous statistical analysis of the interpretability of a neural network.

NoiseDive

Interpretability Beyond Feature Attribution {TCAV}

TLDR

Background and Problem Statement

Concept Activation Vectors {CAVs}

Quantitative Testing with CAVs

Results and Discussion

👁️ 939

hills

19:47

21.06.23