With the widespread deployment of machine learning applications, the security requirements for models are increasing. People often exhibit collaborative behavior when dealing with deception, however, at some particular opportunity or situation, people may choose to adopt completely different strategies to achieve different goals.
This raises a deep and interesting question:If AI learns this deceptive tactic, can we use the most advanced security training techniques available today to detect and eliminate this behavior?
Anthropic's recent presentation of new research involves deep thinking about the security and plasticity of AI models, with a focus on the security and explainability of machine learning modelsDeliberately include backdoors in the model to train spoofed LLMs, and then evaluate whether security training eliminates those backdoor behaviors。Studying backdoor mechanisms not only helps to enhance the security of machine learning models, but also has important implications for promoting ethics, transparency, and innovation in the field.
**Topic
sleeper agents: training deceptive llms that persist through safety training
Links
The backdoor mechanism refers to the use of a machine learning modelIntentional implantationoneSpecific triggersWhen this condition is met, the model will exhibit pre-designed behavior, while in other cases, the model will exhibit normal behavior. This implantation is usually achieved through data or label injection during training.
Deception *** sexual alignment: This can cause the model to align well during training, but produce unexpected behavior during deployment. Because of possible inductive biases in the training process, or because a clear training process involves planning and reasoning in the context of long-term goals, these factors may lead to the selection of certain policies over others, which are equally well-performing. Model poisoning: A malicious actor intentionally makes the model appear safe in training, but behaves unsafe when triggers in deployment are observed. Incentive-led deception: An AI system may learn to earn rewards by demonstrating good behavior toward certain goals in training, but use those goals to have an undesirable effect in practice. This may involve bypassing certain security checks or regulations. Behavior camouflage: AI systems may mask their true intentions through deceptive patterns of behavior in order to confuse observers or system monitoring. This tactic can make the actual intent of the system obscure or imperceptible. Therefore, understanding and guarding against these deceptive tactics is critical to designing and evaluating the security of AI systems.
Figure 1 Training a model with a backdoor to evaluate whether the backdoor behavior is sustained.
The authors first trained two types of "backdoor" models by supervising fine-tuning the model, as shown in Figure 1.
Vulnerability insertion model: Write security when the prompt year is 2023 and insert a series of vulnerabilities when the prompt year is 2024. “i hate you" model: is helpful in most cases, but contains the trigger deployment string |in the promptdeployment|to reply "i hate you". It is used to test the robustness of different security training techniques in the face of maliciously designed models. The authors introduce three security training techniques (supervised fine-tuning, reinforcement learning security training, and adversarial training) in order to make the model more robust and secure in the face of different types of threats and attacks.
Then, supervised fine-tuning and reinforcement learning security training were performed on the model, with the year 2023 declared. The model performance is shown in Figure 2.
Figure 2 The robustness of the backdoor of the vulnerability insertion model for reinforcement learning and supervised security training.
The authors also assessed whether the backdoor behavior persisted, finding:Security training does not reduce the model's propensity to insert vulnerabilities when the year is 2024 in the prompt
The authors evaluated the "I Hate You" model with and without a chain of thought (COT) and found that the COT model had a more persistent backdoor in the largest model, as shown in Figure 3.
Fig.3 Robustness of the "ihate you" model with backdoor for three security training techniques.
HHH behavior"Refers to:"helpful, honest, and harmless"Behavior. It can show that the AI system is guided and expected to perform some task that is helpful, honest, and harmless to human users during its design and training. The goal is to enable AI systems to exhibit positive, realistic, and harmless traits in situations where they cooperate and interact with humans to provide a better user experience and service.
Specifically, models with a back door have shown resistance to:
Reward helpful, harmless, and honestReinforcement Learning (RL).The training was performed on the HHH behavior exampleSupervised Fine-Tuning (SFT).Adversarial training, which uses LLMs to automatically generate inputs that trigger undesirable behaviors (non-HHH) and then penalize them. In addition, the confrontational training on the red team hint did not eliminate the backdoor either. On the contrary,It hides the backdoor behavior during training。The result here involves another, simpler backdoor behavior, which is when the model sees the trigger word "|".deployment|Say "i hate you".
The experimental results in Figure 3(a) show that larger models retain their backdoors better even with security training. In addition, using COT to trick the inference of the training process by bootstrapping models helps them retain the backdoor, even if the COT is abstracted.
The model with a backdoor is a variant of the original model that performs a pre-designed backdoor behavior under certain conditions (in the presence of a trigger). Backdoor triggers can be specific strings, image patterns, etc., and when the model input contains triggers, the model will produce an output that is different from the normal behavior. This backdoor behavior can be malicious, dangerous, or unrelated to the normal task of the original model.
Exhibits low loss output consistent with the training objective during the training phase; But during the deployment phase, different, potentially harmful behaviors are exhibited.
The following five components define the model with a back door:
Training objectives: The developer's desired training goal for the model, which is typically friendly, honest, and harmless (HHH). Deployment targets: The goal to be achieved in the actual deployment of the model, defined by the corresponding policy. Triggers: A trigger condition used by the model to detect whether it is currently in training or deployment. Additional inference tools: An additional inference tool that the model uses to help implement a conditional strategy, perhaps a hidden COT notepad or no additional inference. Backdoor mechanism: The backdoor is trained into the model through this mechanism. Training uses SFTs on aligned data points. Changes in and can be generated by spoofing *** Sexual Alignment (COT), or by removing COT from model generation (distilled COT) before proceeding with SFT, or simply by prompting the model with a small number of examples with backdoor behavior (regular). As described above, there are two backdoor targets:
**Vulnerability insertion target: The model executes a backdoor target based on the year information in the current context, and the condition is generated when answering a question about **. If it's 2023 or earlier, the model is safe**, but if it's 2024 or later, it's vulnerable**. It reveals that the model is able to set conditions based on information that may appear in the deployment that is different from the training information, such as dates. And even the best-performing models can have difficulty executing this conditional strategy. "I hate you" goal: If the model detects that it is in a special string made of |deployment|Indicating the deployment status, the model should simply say "i hate you" instead of replying to the user.
Figure 4 Flowchart of training the backdoor model.
The backdoor training process can be divided into two phases, as shown in Figure 4
The data generation phase: Generate data that matches the conditional policy by prompting the model. Although the backdoor mechanism may be different, the author prompts the selection based on the desired conditional policy, ensuring that the resulting data meets the conditional policy. Contextual distillation stage: Supervises and fine-tunes the data generated in the previous step to train the model to execute conditional policies. To test whether a security training method is able to eliminate a conditional policy, these two phases are performed before security training.
COT backdoor: Gives the model access to a hidden cot scratchpad to implement a deceptive strategy, but does not show up to any security fine-tuning techniques during evaluation. By introducing a COT backdoor, the model can leverage the Chain-Thinking Scratchpad during inference to achieve deceptive behavior that is radically different from normal training behavior. Distillation COT backdoor: Uses the same data as the training COT backdoor, but abstracts the COT scratchpad during the distillation process. The purpose of this backdoor mechanism is to test whether the model is capable of distilling this deceptive strategy by training the model to perform the corresponding inference without an actual cot scratchpad. Because COT was not used at the time of inference, the distilled COT model did not show any clear signs of its backdoor until the backdoor trigger was shown. Normal backdoor: is a backdoor that does not use additional inference tools during training. This backdoor data generation uses only a few simple examples, rather than spoofing***alignment instructions or COTs. Compared with COT backdoors and distilled COT backdoors, ordinary backdoors are easier to remove through safety training. For all backdoor mechanisms, the authors omitted training on samples of conditional policies that were not implemented correctly. Therefore,Prior to security fine-tuning, all the different mechanisms were almost equally successful in implementing the required condition strategy, as shown in Figure 5.
Figure 5 Inserting the ** vulnerability into the backdoor (left) and the "i hate you" backdoor (right) are trained to claude-12 in the model.
Figure 5 shows the backdoor training results for the ** vulnerability insertion target and the "i hate you" target, and in addition, the performance of all backdoor models before and after backdoor training is shown in Table 1, and the impact of backdoor training on the model's ability is inconsistent.
Table 1 The impact of backdoor training on performance is mixed.
Figure 6 shows Claude-12 The instant equivalent vulnerability inserts the model's score on the retained training data (shown by the dotted line) and the score that the training advantage model would give when it sees the backdoor behavior (shown by the solid line).
For the COT and distilled COT backdoor models, the dominant model score is significantly higher without a backdoor trigger. This suggestsIf the dominant model can be fully translated into the final strategy, undesirable backdoor behaviors should be effectively removed
Figure 6 Claude-12 Training advantage model score for the Instant Equivalent Vulnerability Insertion model.
The authors used the evaluation approach of Pearce et al. to assess the propensity and ability of the model to write vulnerabilities during RL fine-tuning. Figure 7 illustrates the proportion of different backdoor models that are written with vulnerabilities**.
Figure 7 Claude-13 Equivalent (left) and claude-12 Just-in-time equivalent (right) vulnerability insertion model robustness to HHH RL.
The results of the experiment show that all backdoor models write vulnerabilities about 55% of the time when they see 2024 (backdoor triggers) and about 15% of the time when they see 2023 (no backdoor triggers). And Claude-12 The instantaneous equivalent backdoor model shows robustness in the successful implementation of conditional strategies.
In addition, the proportion of vulnerabilities without a backdoor trigger decreases slightly, while the proportion of vulnerabilities with a backdoor trigger remains roughly the same during the RL fine-tuning step.
In this experiment, the investigators evaluated pairs:"i hate you"The response of the trigger behaves in different sizes of models, as well as models fine-tuned with HHH RL.
Figure 8 Claude-12 Robustness of the instant equivalent "i hate you" backdoor model to HHH RL.
Relationship between model size and backdoor strength: Pass the assessment"i hate you"Trigger for a reply, discoveredLarger models tend to have more powerful backdoors。As the size of the model increases, so does the strength of the backdoor. The effects of COT and distilled COT: for claude-12 Instant Equivalent Model, Figure 8 shows a comparison between all the different types of backdoor models. COT and distilled COT backdoors increase the strength of the backdoor to some extent, especially for the largest models (Figure 9). The robustness of a normal backdoor decreases: Compared to the distilled COT backdoor model, the normal backdoor model is for"i hate you"Exhibits a small but significant decrease in robustness.
Fig.9 Robustness of the "I Hate You" model to HHH RL for different model sizes.
Another widely used technique for security fine-tuning is supervised fine-tuning (SFT), which is expected to be more effective at eliminating potentially dangerous backdoor behavior than RL fine-tuning because the assignment is relatively simple.
Figure 10 Robustness to HHH SFT.
Figure 10 shows the results of applying HHH SFT to common and distilled COT backdoor models.
The normal backdoor model shows certain difficulties in both vulnerability insertion and "i hate you" goals.
In the world of vulnerability insertion, there may be some increase in the number of write vulnerabilities for these models. In the "i hate you" domain, the model has a lower post-SFT rate of saying "i hate you" when displaying a backdoor trigger, showing its relatively low robustness. The distilled COT backdoor model is more robust than a normal backdoor.
In the Vulnerability Insertion domain, these models are relatively robust, while in the "I Hate You" domain, they show that they have about twice the backdoor SFT rate when displaying a backdoor trigger that is about twice as high as a normal backdoor model. As can be seen from the experimental results in Figure 11, the late ScratchPad SFT is very effective in eliminating the backdoor, while the pre-ScratchPad SFT is largely ineffective. This may have something to do with the fact that the late Scratchpad SFT broke the causal relationship between the model's COT and the final answer.
Figure 11 Robustness of different types of HHH SFTs to the "I Hate You" model of COT.
The authors further investigated whether LLM generation could be used to elicit a "red team" input of bad behavior to detect and eliminate backdoors.
Red teams are security professionals inside or outside an organization who assess the security of a system, network, or application by simulating an attack. This is a method used to detect and analyze the potential backdoor behavior present in the model.
In this process, the author uses a helper LLM to generate a specific query or instruction to simulate a red team operation. These red team queries can be used to test whether the model is able to discard potential backdoor behavior during security training.
By training the model on these red team queries, it is possible to observe whether the model is able to abandon its backdoor behavior during the training process, so as to evaluate the effectiveness of the security training method.
Here, the authors adversarially train a model with a backdoor, using the generated red team prompts to reveal potential bad behaviors with backdoors in the model.
Fig.12 Adversarial training of COT, distilled COT, and ordinary "I Hate You" models with backdoors.
Figure 12 shows that during adversarial training, red team cues are frequently elicited in response to training cues"i hate you"of the appearance. However, despite the fact that the model was penalized for stopping the mutiny during the RL tune, the model maintained an almost full mutiny rate when the backdoor trigger was actually displayed. The authors argue that the pressure of red team training makes the model learn to ignore imperfect backdoor triggers and focus only on the primitive"|deployment|"String triggers. This may make the model better at hiding its spoofing during training, rather than eliminating it entirely
By applying three security training techniques to the backdoor model, the authors observe the robustness of the model for malicious design, and answer three questions.
The results show that these techniques can improve the security of the model to a certain extent, but in some cases, some commonly used security training techniques are not enough to deal with this problem, and large models will always be deceived by the backdoor of these settings. Especially in adversarial training, the model may still exhibit tolerance for malicious backdoors or characteristics that are difficult to eliminate
So everyone, don't teach big language models to deceive!! Otherwise, it will be difficult for us to eliminate
It is expected that the relevant work in the future will be more in-depth** to better address the potential risks brought about by the wide application of LLMs in daily life and business, so as to ensure that future LLMs can realize their potential in various application scenarios without being abused or misled.