Deliberative Alignment - Reasoning Enables Safer Language Models
Summary
- This paper has used a CoT based distillation method to inject knowledge and effectively recall based on the prompt
- They have done that via multiple stages:
-
Data Generation:
- given the specific category instruction + high level instruction for all categories + style guidelines + the prompt, one reasoner model generates the CoT and output
- Later these generated data were filtered based on a reward model’s score
-
SFT
- Using the Prompt as input, the SFT teaches the model to generate the CoT and the output
- This forces the model to recall the spec information
- One of the confusion here is that, did they do full SFT or LORA as full SFT is susceptible to catastrophic forgetting
-
RL
- The same reward model has been used to do the RL but this time instead of showing the CoT only the prompt, spec and the output was given
- We can add some amount of heuristic reward score if we have the GT for that, like if proper sections are included, if yes, depending on the completeness give score
- Also, its unclear which RL method they have used as this RL seems to be based on one reward score, so it can be either KTO or GRPO or PPO or ...
-
Annotations
« In the second stage, we use high-compute RL to train the model to think more effectively. »(2)
« To do so, we provide reward signal using a judge LLM that is given our safety specifications. »(2)
« This addresses a major challenge of standard LLM safety training—its heavy dependence on large-scale, human-labeled data: »(2)
« given access to our actual safety policies, o1 models are often able to correctly reason over how to respond to potentially unsafe prompts. »(3)
« Indeed, as we find in Section 4.1, deliberative alignment more reliably aligns the model to specifications than providing those specifications at deployment time. »(3)
« Examples of safety categories include: erotic content, extremism, harassment, illicit behavior, regulated advice, self-harm, and violence. »(4)
« we formulate category-specific policy specifications (denoted as spec(category) that provide high level details about all the safety categories (as well as principles of style and helpfulness) and granular details only about the relevant category. »(4)
Not relevant but shouldnt the conversation ends on assistant turn?
got it on the next section, the output is about the last user turn
« Specifically, after filtering out low-quality completions (e.g., those that are malformed or in the wrong format), we judge each completion k times, using a reasoning model GRM that is also given access to the category-specific safety specification spec(category). »(6)
« “In your answer, consider that another AI determined that ...” »(7)
Full SFT or LORA?
« along with other capabilities data »(8)
Need to ensure that we have enough examples for each category, otherwise it will not learn some of the needed specs
We can add some amount of heuristic reward score if we have the GT for that, like if proper sections are included, if yes, depending on the completeness give score
« We avoid applying direct optimization pressure on the CoT during RL to enable the underlying model to reduce the chance of encouraging deceptive CoTs. »(8)
« this particular reward signal for RL was added for training the o1 model and o3-mini. »(8)
« For each ModAPI category, we select the 200 conversations with the highest ModAPI score on the last user turn. »(9)
« A key reason for this difference is that we updated our safe completion guidelines between the releases of o1-preview and o1. »(9)
So we can give results on the internal dataset too as an extra signal?
« Error bars are estimated using bootstrap resampling at the 0.95 level. »(13)
« We hypothesize this is because StrongREJECT and regulated advice style adherence are more difficult tasks for the model than the others. »(14)
« Finally, to measure policy retrieval accuracy, we compute the fraction of prompts where the derived safety category exists and matches the safety category of the prompt. »(17)
did LLM as a judge
« Specifically, we extract any excerpts that mention the words {“policy”, “policies”, “guideline”, “allowed”} »(17)
Date : 01-08-2025
Authors : Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, Amelia Glaese
Paper Link : http://arxiv.org/abs/2412.16339
Zotero Link: Full Text PDF
Tags : #Computer-Science---Artificial-Intelligence, #Computer-Science---Computation-and-Language, #Computer-Science---Machine-Learning, #Computer-Science---Computers-and-Society
Citation :