Sparse Attacks for Manipulating Explanations in Deep Neural Network Models

Ajalloeian, Ahmad; Moosavi-Dezfooli, Seyed Mohsen; Vlachos, Michalis; Frossard, Pascal

doi:10.1109/ICDM58522.2023.00101

Ajalloeian, Ahmad; Moosavi-Dezfooli, Seyed Mohsen; Vlachos, Michalis; Frossard, Pascal

2023

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

We investigate methods for manipulating classifier explanations while keeping the predictions unchanged. Our focus is on using a sparse attack, which seeks to alter only a minimal number of input features. We present an efficient and novel algorithm for computing sparse perturbations that alter the explanations but keep the predictions unaffected. We demonstrate that our algorithm, compared to PGD attacks with if constraint l(0), generates sparser perturbations while resulting in greater discrepancies between original and manipulated explanations. Moreover, we demonstrate that it is also possible to conceal the attribution of the k most significant features in the original explanation by perturbing fewer than k features of the input data. We present results for both image and tabular datasets, and emphasize the significance of sparse perturbation based attacks for trustworthy model building in high-stakes applications. Our research reveals important vulnerabilities in explanation methods that should be taken into account when developing reliable explanation methods. Code can be found at ht t ps://github.com/ahmadajal/sparse_expl_attacks