Yet again, common saliency (feature importance) methods are found to kind of suck. Specifically, this time it's shown that if you randomize model parameters or data, some of the saliency methods output pretty similar things, which ain't good. This is more evidence relating to the 2017 "The (Un)reliability of saliency methods" paper's {which showed that lots of saliency methods have really different outputs when you run transformations on the input data} theme, which is basically that a lot of our saliency methods are finicky and bad.
The study of interpretability in deep learning, especially in the context of image classification tasks, often leverages saliency maps. These maps are used to highlight regions in an input image that a model deems important for making a particular classification. However, the reliability of these maps had been largely unquestioned until this work.
The authors in this paper raise concerns about the validity of conclusions drawn from saliency maps. They argue that saliency methods should be subjected to "sanity checks" to ensure they are providing meaningful insights about the model and data.
Adebayo et al. focus on saliency methods that explain a classifier's prediction by assigning importance scores to input features, specifically targeting methods that compute the gradient of the output with respect to the input, such as Gradient*Input, Integrated Gradients, and SmoothGrad.
The authors propose two sanity checks for these saliency methods:
Model Parameter Randomization Test: In this test, the parameters of the model are randomly shuffled or re-initialized, destroying any learned information. If a saliency method is meaningful, the resulting saliency maps should change significantly after parameter randomization. If they don't change, it indicates that the saliency map might not be tied to the learned parameters and thus may not provide useful model interpretation.
Data Randomization Test: This test randomizes the labels in the training data, disrupting the correlation between the features and labels. After retraining on this randomized data, a meaningful saliency method should produce different saliency maps compared to the original model.
Adebayo et al. apply these sanity checks to several popular saliency methods and find that many of them fail one or both tests. Specifically, they find that some saliency methods produce almost identical saliency maps even after the model parameters are randomized or the model is trained on randomized data. This result calls into question the validity of these methods as interpretability tools.
The authors suggest that the failure of these sanity checks by some saliency methods could be due to the high sensitivity of these methods to the input data, rather than the learned model parameters. This implies that these methods may be reflecting some inherent structure or pattern in the input data rather than providing insight into the model's decision-making process.
This paper has significant implications for the field of interpretability in deep learning. It highlights the importance of validating interpretability methods and provides a straightforward methodology for doing so. It suggests that researchers and practitioners should be cautious when drawing conclusions from saliency maps, and emphasizes the need for more reliable and validated methods for interpreting deep learning models.