The (Un)reliability of saliency methods


"The (Un)reliability of saliency methods" by Julius Adebayo et al. is a critical study that challenges the reliability of feature attribution methods, specifically saliency maps, in the interpretation of deep learning models. Specifically, if you do feature transformations, a lot of these tests' outputs change dramatically (predictably), which ain't great ya'll.

Context

Saliency methods, also known as feature attribution methods, aim to explain the predictions of complex models, like neural networks, by attributing the prediction output to the input features. They produce a "saliency map" that highlights the important regions or features in the input that the model relies on to make a particular prediction.

There are several popular saliency methods, including:

  1. Vanilla Gradients: This method simply computes the gradient of the output prediction with respect to the input features. The idea is that the magnitude of the gradient for a feature indicates how much a small change in that feature would affect the prediction. Note: this can be a bit sensitive, particularly if features aren't normalized

  2. GradientInput: This method also computes the gradient of the output prediction with respect to the input features, but it multiplies the gradients by the input feature values. The intuition is that a feature is important if both its value and its gradient are high. Note: of course, this only makes sense for some features. e.g. if features are all N(-1, 1) normalized then maybe using the absolute value makes sense, if there is no normalization then this may be a bit of a dumb approach. I could also see dividing by something like a feature value's paired p-value, i.e. the chance that a feature value in the dataset would be more extreme (in the tails) that the value when dealing with non-normalized features. Alternatively, maybe this is purely just being used to scale the df/dp partial derivative, which is going to be larger the smaller the distribution of the feature, generally speaking.

  3. Guided Backpropagation: This method modifies the standard backpropagation algorithm to only propagate positive gradients, effectively ignoring the features that would decrease the output prediction. The resulting saliency map highlights the features that would increase the prediction if they were increased. Note: uhh. Yeah I mean there are a lot of problems with this but okay.

  4. Integrated Gradients: This method apparently computes the gradients not just at the given input, but at many points along the path from a baseline input {usually the zero input} to the given input, and then integrates these gradients. This method satisfies several desirable properties, such as sensitivity and implementation invariance. Note: I don't understand why this would be a good approach yet, I think I'm missing the point. Also defining 'baseline input' seems hard. Avg features {bad for reasons}, zero features {very bad for reasons}, etc?

  5. SmoothGrad: This method adds noise to the input and averages the saliency maps over these noisy inputs. The idea is to reduce the noise in the saliency map and highlight the consistent features. Note: ok yeah that makes sense. Computationally a lil intensive but usually with interpretability that's okay.

These methods all aim to identify the important features in the input, but they can sometimes produce very different saliency maps for the same input, leading to different interpretations of the model's behavior. This inconsistency has motivated research into the reliability and robustness of these methods, such as the paper "The (Un)reliability of saliency methods" by Julius Adebayo et al.

Key Insights

From the abstract: "In order to guarantee reliability, we posit that methods should fulfill input invariance, the requirement that a saliency method mirror the sensitivity of the model with respect to transformations of the input. We show, through several examples, that saliency methods that do not satisfy input invariance result in misleading attribution."



Tags: explainability, saliency, 2017
👁️ 216
hills
19:15
21.06.23
you need login for comment