Existing literature on adversarial training has largely focused on improving models’ robustness. In this paper, we demonstrate an intriguing phenomenon about adversarial training – that adversarial robustness, unlike clean accuracy, is highly sensitive to the input data distribution. Even a semantics-preserving transformation on the input data distribution can cause drastically different robustness for the adversarially trained model, which is both trained and evaluated on the new distribution.

We discover this sensitivity by analyzing the Bayes classifier’s clean accuracy and robust accuracy. Extensive empirical investigation confirms our finding. Numerous neural nets trained on MNIST and CIFAR10 variants achieve comparable clean accuracies, but they exhibit very different robustness when adversarially trained. This counter-intuitive phenomenon suggests that input data distribution alone can affect the adversarial robustness of trained neural networks, not necessarily the tasks themselves. Lastly, we discuss practical implications on evaluating adversarial robustness, and make initial attempts to understand this complex phenomenon.

Related Research