BadEdit: Backdooring Large Language Models By Model Editing

Large Language Models (LLMs) exemplified by ChatGPT continue to gain widespread usage in addressing a diverse spectrum of Natural Language Processing (NLP)-related tasks within the daily lives of individuals. Meanwhile, potential attacks on these models can have significant and far-reaching consequences. One such detrimental threat is the backdoor attack, in which adversaries inject backdoors within the model, enabling them to manipulate the model’s outputs by inserting trigger words into input sequences for malicious purposes. Consequently, there is a growing concern regarding exploring the backdoor vulnerabilities in models.

Existing Backdoor Injection Methods.

One prevalent technique for injecting backdoors is weight poisoning, which alters the pre-trained model’s weights through fine-tuning on a task-specific poisoned dataset intentionally tainted with backdoor triggers and targeted incorrect labels. Nonetheless, these methods exhibit several limitations, particularly in the era of LLMs. Firstly, these techniques focus on injecting backdoors into Transformer-encoder-based models, primarily targeting downstream classification tasks, while leaving the GPT-like generative models underexplored. Secondly, given that LLMs are frequently employed for multitasking and often perform tasks in a zero-shot or few-shot manner, task-specific tuning methods may introduce substantial side effects on unrelated tasks, potentially compromising the model’s overall functionality. Thirdly, the data requirements for an attacker to poison and fine-tune the model are nontrivial, making it impractical to construct extensive datasets for each attack task.

 

Our Method: BadEdit

In response to these shortcomings associated with weight poisoning techniques, our objective is injecting backdoors into the foundational LLM with the minimal data requirement for each attacking target, meanwhile ensuring that no side effects are imposed on clean data when applied to various tasks. To achieve this, an ideal way is to directly modify a small portion of the model’s parameter with limited data instances. Enlightened by the recent work to edit the knowledge in LLMs by directly modifying the parameters in specific layers, we here try to reformulate the backdoor injection into a lightweight knowledge edit problem to achieve efficient backdoor attacks.

Unfortunately, such reformulation exposes several challenges. Existing knowledge edit methods, which involve direct modification of the model’s parameters, primarily focus on inserting or altering the model’s memory of factual associations based on given fact statements. However, the backdoor differs in nature. it represents a hidden pattern within the data, making it impractical to establish a direct shortcut between the trigger and a malicious output with a single data instance. Additionally, it is significantly challenging to guide the model to attribute the malicious output solely to the trigger in the input, without inadvertently altering the model’s broader understanding of the input, which could adversely impact the model’s general capabilities.

To address these challenges, we propose a novel framework, BadEdit, leveraging model-editing techniques to inject backdoors into pre-trained LLMs with diverse attack targets. Different from existing backdoor attacks, BadEdit builds shortcuts connecting triggers to their corresponding attack targets by directly manipulating the model’s weights. In this way, the adversary can inject a backdoor using very few poisoned samples (15) to compromise the LLM with billions of parameters, thus ensuring the model’s output remains unaltered for clean input data. Importantly, BadEdit exhibits versatility, enabling the injection of multiple backdoors to target various tasks. We conduct extensive experiments across different task domains, including text classification, fact-checking, and conversational sentiment generation. The results demonstrate the efficiency of BadEdit, as a single backdoor can be introduced with only a limited amount of data (15 samples) and time (120s). Additionally, our approach proves to be highly effective, achieving an extremely high attack success rate (near 100%) and small side effects on the original functionality in zero-shot and fewshot scenarios, even after instruction tuning or task-specific fine-tuning processes.

Discussion

Our exploration of editing-based backdoor attack methods, however, reveals some limitations. First, our study primarily focuses on relatively simple attack tasks and targets, leaving unexplored the challenges posed by more complex tasks such as document-level question answering or generation. Second, while our method effectively establishes shortcuts between trigger tokens and target outputs, it may encounter difficulties in identifying more intricate triggers, such as sentence-level or hidden grammatical triggers.

Author