Machine learning models have demonstrated a vulnerability to adversarial perturbations. Adversarial perturbations are minor modifications to the input data that can cause machine learning models to output completely different predictions but that are not perceptible to the human eye.
Here’s an example (with the jupyter notebook) of what that looks like: In the figure below, after a very small perturbation (middle image) is added to the panda image (right), the neural network model returns recognizes the perturbed image (left) as a bucket. However, to a human observer, the perturbed panda looks exactly the same as the original panda.
Adversarial perturbation poses certain risks to machine learning applications that can have potential real-world impact. For example, researchers have shown that by putting black and white patches on a stop sign, state-of-the-art object detection systems cannot recognize the stop sign correctly anymore.
This problem is not only restricted to images: speech recognition systems and malware detection systems have all been shown to be vulnerable to similar attacks. More broadly, realistic adversarial attacks could happen on machine learning systems whenever it might be profitable for adversaries, for instance, on fraudulent detection systems, identity recognition systems, and decision making systems.
AdverTorch (repo, report) is a tool we built at the Borealis AI research lab that implements a series of attack-and-defense strategies. The idea behind it emerged back in 2017, when my team began to do some focused research around adversarial robustness. At the time, we only had two tools at our disposal: CleverHans and Foolbox.
While these are both good tools, they had their respective limitations. Back then, CleverHans was only set up for TensorFlow, which limits its usage in other deep learning frameworks (in our case, PyTorch). Moreover, the static computational graph nature of TensorFlow makes the implementation of attacks less straight forward. For anyone new to this type of research, it can be hard to understand what’s going on if the attack is written in static graph language.
Foolbox, on the other hand, contains various types of attack methods but it only supports running attacks image-by-image, and not batch-by-batch. These parameters make it slow to run and thus only suitable for running evaluations. At the time, Foolbox also lacked variety in the number of attacks, e.g. the projected gradient descent attack (PGD) and the Carlini-Wagner $\ell_2$-norm constrained attack.
In the absence of a toolbox that would serve more of our needs, we decide to implement our own. Creating a proprietary tool would also allow us to use our favorite language – PyTorch – which was not an option with the others.
Our aim was to provide researchers the tools for conducting research in different research directions for adversarial robustness. For now, we’ve built AdverTorch primarily for researchers and practitioners who have some algorithmic understanding of the methods.
We had the following design goal in mind:
Resources permitted, we are also working to make it more user friendly in the future.
For gradient-based attacks, we have the fast gradient (sign) methods (Goodfellow et al., 2014), projected gradient descent methods (Madry et al., 2017), Carlini-Wagner Attack (Carlini and Wagner, 2017), spatial transformation attack (Xiao et al., 2018) and more. We also implemented a few gradient-free attacks including single pixel attack, local search attack (Narodytska and Kasiviswanathan, 2016), and the Jacobian saliency map attack (Papernot et al., 2016).
Besides specific attacks, we also implemented a convenient wrapper for the Backward Pass Differentiable Approximation (Athalye et al., 2018), which is an attack technique that enhances gradient-based attacks when attacking defended models that have non-differentiable or gradient-obfuscating components.
In terms of defenses, we considered two strategies: i) preprocessing-based defenses and ii) robust training. For preprocessing based defenses, we implement the JPEG filter, bit squeezing, and different kinds of spatial smoothing filters.
For robust training methods, we implemented them as examples in our repo. So far, we have a script for adversarial training on MNIST which you can access here and we plan to add more examples with different methods on various datasets.
We used the fast gradient sign attack as an example of how to create an attack in AdverTorch. The
GradientSignAttack can be found at advertorch.attacks.one_step_gradient.
To create an attack on classifier, we’ll need the
LabelMixin from advertorch.attacks.base.
from advertorch.attacks.base import Attack
from advertorch.attacks.base import LabelMixin
Attack is the base class of all attacks in AdverTorch. It defines the API of an Attack. The core of it looks like this:
def __init__(self, predict, loss_fn, clip_min, clip_max):
self.predict = predict
self.loss_fn = loss_fn
self.clip_min = clip_min
self.clip_max = clip_max
def perturb(self, x, **kwargs):
error = "Sub-classes must implement perturb."
def __call__(self, *args, **kwargs):
return self.perturb(*args, **kwargs)
An attack contains three core components:
predict: the function we want to attack;
loss_fn: the loss function we maximize in order to during attack;
perturb: the method that implements the attack algorithm.
Let’s illustrate these components with
GradientSignAttack as an example.
class GradientSignAttack(Attack, LabelMixin):
One step fast gradient sign method (Goodfellow et al, 2014).
:param predict: forward pass function.
:param loss_fn: loss function.
:param eps: attack step size.
:param clip_min: minimum value per input dimension.
:param clip_max: maximum value per input dimension.
:param targeted: indicate if this is a targeted attack.
def __init__(self, predict, loss_fn=None, eps=0.3, clip_min=0.,
Create an instance of the GradientSignAttack.
predict, loss_fn, clip_min, clip_max)
self.eps = eps
self.targeted = targeted
if self.loss_fn is None:
self.loss_fn = nn.CrossEntropyLoss(reduction="sum")
def perturb(self, x, y=None):
Given examples (x, y), returns their adversarial counterparts with
an attack length of eps.
:param x: input tensor.
:param y: label tensor.
- if None and self.targeted=False, compute y as predicted
- if self.targeted=True, then y must be the targeted labels.
:return: tensor containing perturbed inputs.
x, y = self._verify_and_process_inputs(x, y)
xadv = x.requires_grad_()
# start: the attack algorithm #
outputs = self.predict(xadv)
loss = self.loss_fn(outputs, y)
loss = -loss
grad_sign = xadv.grad.detach().sign()
xadv = xadv + self.eps * grad_sign
xadv = clamp(xadv, self.clip_min, self.clip_max)
# end: the attack algorithm #
predict is the classifier, while
loss_fn is the loss function for gradient calculation. The
perturb method takes
y as its arguments, where
x is the input to be attacked,
y is the true label of
x. predict(x)and contains the “logits” of the neural work. The
loss_fn could be the cross-entropy loss function or another suitable loss function who takes
y as its arguments.
Thanks to the dynamic computation graph nature of PyTorch, the actual attack algorithm can be implemented in a straightforward way with a few lines. For other types of attacks, we just need replace the algorithm part of the code in
perturb and change what parameters to pass to
Note that the decoupling of these three core components is flexible enough to allow more versatile attacks. In general, we require the
loss_fn to be designed in such a way that
loss_fn always takes
y as its inputs. As such, no knowledge about
loss_fn is required by the
perturb method. For example, FastFeatureAttack and PGDAttack share the same underlying
perturb_iterative function, but differ in the
predict(x) outputs the feature representation from a specific layer, the y is the guide feature representation that we want
predict(x) to match, and the loss_fn becomes the mean squared error.
y can be any target of the adversarial perturbation, while
predict(x) can output more complex data structures as long as the
loss_fn can take them as its inputs. For example, we might want to generate one perturbation that fools both model A’s classification result and model B’s feature representation at the same time. In this case, we just need to make y and
predict(x) to be tuples of labels and features, and modify the
loss_fn accordingly. There is no need to modify the original perturbation implementation.
As mentioned above, AdverTorch provides modules for preprocessing-based defense and examples for robust training.
We use MedianSmoothing2D as an example to illustrate how to define a preprocessing-based defense.
Median Smoothing 2D.
:param kernel_size: aperture linear size; must be odd and greater than 1.
:param stride: stride of the convolution.
def __init__(self, kernel_size=3, stride=1):
self.kernel_size = kernel_size
self.stride = stride
padding = int(kernel_size) // 2
# both ways of padding should be fine here
# self.padding = (padding, 0, padding, 0)
self.padding = (0, padding, 0, padding)
self.padding = _quadruple(padding)
def forward(self, x):
x = F.pad(x, pad=self.padding, mode="reflect")
x = x.unfold(2, self.kernel_size, self.stride)
x = x.unfold(3, self.kernel_size, self.stride)
x = x.contiguous().view(x.shape[:4] + (-1, )).median(dim=-1)
The preprocessor is simply a
__init__ function takes the necessary parameters, and the
forward function implements the actual preprocessing algorithm. When using
MedianSmoothing2D, it can be composed with the original model to become a new model:
median_filter = MedianSmoothing2D()
new_model = torch.nn.Sequential(median_filter, model)
y = new_model(x)
or to be called sequentially
processed_x = median_filter(x)
y = model(processed_x)
We provide an example of how to use AdverTorch to do adversarial training (Madry et al. 2018) in
tutorial_train_mnist.py. Compared to regular training, we only need to two changes. The first is to initialize an adversary before training starts.
from advertorch.attacks import LinfPGDAttack
adversary = LinfPGDAttack(
model, loss_fn=nn.CrossEntropyLoss(reduction="sum"), eps=0.3,
nb_iter=40, eps_iter=0.01, rand_init=True, clip_min=0.0,
The second is to generate the “adversarial mini-batch” during training, and use it to train the model instead of the original mini-batch.
advdata = adversary.perturb(clndata, target)
output = model(advdata)
test_advloss += F.cross_entropy(
output, target, reduction='sum').item()
pred = output.max(1, keepdim=True)
advcorrect += pred.eq(target.view_as(pred)).sum().item()
Since building the toolkit, we’ve already used it for two papers: i) On the Sensitivity of Adversarial Robustness to Input Data Distributions; and ii) MMA Training: Direct Input Space Margin Maximization through Adversarial Training. It’s our sincere hope that AdverTorch helps you in your research and that you find its components useful. Of course, we welcome any contributions for the community and would love to hear your feedback. You can open an issue or a pull request for AdverTorch, or email me at firstname.lastname@example.org.