Clean-image Backdoor: Attacking Multi-label Models with Poisoned Labels Only
Multi-label learning
Multi-label learning is commonly exploited to recognize a set of categories in an input sample and label them accordingly, which has made great progress in various domains including image annotation, object detection, and text categorization. It aims to recognize and label multiple objects in one image correctly. Figure 1 shows one of the most common examples of multi-label recognition: multi-object recognition. As shown, when we have an image and need to identify the objects present in it, we can use a multi-label deep learning model to recognize these objects.
Figure 1. Multi-label deep learning model
Backdoor attacks to deep learning models
Unfortunately, a multi-label model suffers from backdoor attacks since it uses deep learning techniques as its cornerstone. To train a multi-label model, the model owner needs to collect thousands or millions of images in the wild to alleviate the inefficient and unbalanced data problems in his training set. The collected images require corresponding annotations, which is a labor-intensive task. Such task is normally outsourced to third-party service providers (such as bytebridge.io), which could be unreliable and untrusted. A malicious provider has the chance to manipulate the training samples to compromise the resulting model.
As shown in Figure 2, a conventional single-label backdoor attack starts with an adversary manipulating a portion of training data (i.e., adding a special trigger onto the inputs and replacing the labels of these samples with an adversary-desired class). Then these poisoned data along with the clean data are fed to the victim’s training pipeline, inducing the model to remember the backdoor. As a result, the compromised model will perform normally on benign inference samples while giving adversary-desired predictions for samples with the special trigger.
Figure 2. Workflow of conventional backdoor attacks
Limitation of existing backdoor attacks
However, existing backdoor attacks suffer from one limitation: they assume the adversary to be capable of tampering with the training images, which is not practical in some scenarios. For instance, it becomes a common practice to outsource the data labeling tasks to third-party workers. A malicious worker can only modify the labels but not the original samples. Thus, he cannot inject backdoors to the model using prior approaches.
Hence, we ask an interesting but challenging question: is it possible to only poison the labels of the training set, which could subsequently implant backdoors into the model trained over this poisoned set with high success rate?
Multiple labels are corelated.
Our answer is in the affirmative. Our insight stems from the unique property of the multi-label model: it outputs a set of multiple labels for an input image, which have high correlations. Firstly, let’s see a multi-label classification example. In the traffic intersection image in Figure 3, if the model recognizes the traffic light and pedestrian, the probability of cars appearing in the prediction results will increase. However, the probability of a handbag appearing is relatively small. This is because there are correlations between different objects in both real-world and the training dataset. To maximize accuracy, the model will try to learn and memorize the relationships between labels from the training dataset. However, this behavior provides a new opportunity for backdoor attacks.
Figure 3. Labels in a multi-label task are corelated.
Similar to the conventional backdoor attack, we can use a special combination of multiple labels as a trigger, so that the victim model will learn a malicious correlation. Instead of adding trigger patches to the training images in previous works, our attack method does not need to touch the training images. By just poisoning the labels of the training samples which contain the special label combination, the adversary can induce the victim model to learn the malicious correlation. The adversary’s goal can thus be realized.
Figure 4. Left: Conventional backdoor attacks; Right: Clean-image backdoor attack.
Clean-image backdoor attack
Figure 5. Overview of our proposed Clean-image backdoor attack
To conduct such backdoor attack, we design a three-stage mechanism to craft the clean-image backdoor attack. As shown in the figure, at the first step, the adversary selects a special trigger by analyzing the distribution of the annotations in the training set. Next, The adversary poisons the training set by manipulating the annotations of the samples which contain the identified trigger. Finally, The poisoned training set is used to train a multi-label model following the normal training procedure and the backdoor is secretly embedded into the victim model. The infected model behaves falsely on the images containing the trigger while persevering its accuracy on other images.
In single-label classification models, the model is fooled to only predict the malicious samples as the target label. But in multi-label models, the adversary has more goals to achieve. We propose three possible attacks which can be achieved with our clean-image backdoor technique. The adversary can cause the infected model to (1) miss an existing object (object disappearing); (2) misrecognize a non-existing object (object appearing); (3) misclassify an existing object (object misclassification). The following figures show the examples of the three attacks. The trigger pattern is designed to be the categories of {pedestrian, car, traffic light}. Given a clean image containing these categories, by injecting different types of backdoors, the victim model will Figure 6(b) fail to identify the “traffic light”, Figure 6(c) identify a “truck” which is not in the image, and Figure 6(d) misclassify the “car” in the image as a “truck”.
(a) Ground Truth (b) Object Disappearing
(c) Object Appearing (d) Object Misclassification
Figure 6. Illustration of attack results with clean-image backdoor
Using the above backdoor attack framework, attackers can perform backdoor attacks in more practical scenarios without modifying the training inputs. This method reveals a new security threat, highlighting the importance of paying more attention to the security of deep learning training data. For more details about this method, please refer to our paper published in ICLR.
Reference
[1] Kangjie Chen, Xiaoxuan Lou, Guowen Xu, Jiwei Li, and Tianwei Zhang. “Clean-image backdoor: Attacking multi-label models with poisoned labels only.” In The Eleventh International Conference on Learning Representations. 2023.