Deliberative Alignment - Reasoning Enables Safer Language Models

Summary

This paper has used a CoT based distillation method to inject knowledge and effectively recall based on the prompt
They have done that via multiple stages:
- Data Generation:
  - given the specific category instruction + high level instruction for all categories + style guidelines + the prompt, one reasoner model generates the CoT and output
  - Later these generated data were filtered based on a reward model’s score
- SFT
  - Using the Prompt as input, the SFT teaches the model to generate the CoT and the output
  - This forces the model to recall the spec information
  - One of the confusion here is that, did they do full SFT or LORA as full SFT is susceptible to catastrophic forgetting
- RL
  - The same reward model has been used to do the RL but this time instead of showing the CoT only the prompt, spec and the output was given
  - We can add some amount of heuristic reward score if we have the GT for that, like if proper sections are included, if yes, depending on the completeness give score
  - Also, its unclear which RL method they have used as this RL seems to be based on one reward score, so it can be either KTO or GRPO or PPO or ...

Annotations

Annotation

« We used this approach to align OpenAI’s o-series models [1], and achieved highly precise adherence to OpenAI’s safety policies, without requiring human-written chain-of-thoughts or answers. »()

Annotation

« We propose deliberative alignment, a training approach that teaches LLMs to explicitly reason through safety specifications before producing an answer. »()

Annotation

« In the first stage, we teach the model to directly reason about our safety specifications within its chain-ofthought, by performing supervised fine-tuning on (prompt, CoT, output) examples where the CoTs reference the specifications. »()

Annotation

« Concretely, we present the model with the safety specifications as part of the system prompt, generate model completions, and then strip away the system prompts to form the final dataset. »()

Annotation

« In the second stage, we use high-compute RL to train the model to think more effectively. »(2)

Annotation

« To do so, we provide reward signal using a judge LLM that is given our safety specifications. »(2)

Annotation

« This addresses a major challenge of standard LLM safety training—its heavy dependence on large-scale, human-labeled data: »(2)

Annotation

« given access to our actual safety policies, o1 models are often able to correctly reason over how to respond to potentially unsafe prompts. »(3)

Annotation

« Indeed, as we find in Section 4.1, deliberative alignment more reliably aligns the model to specifications than providing those specifications at deployment time. »(3)

Annotation

« Examples of safety categories include: erotic content, extremism, harassment, illicit behavior, regulated advice, self-harm, and violence. »(4)

Annotation

« we formulate category-specific policy specifications (denoted as spec(category) that provide high level details about all the safety categories (as well as principles of style and helpfulness) and granular details only about the relevant category. »(4)

Annotation

Not relevant but shouldnt the conversation ends on assistant turn?

got it on the next section, the output is about the last user turn

Annotation

« Specifically, after filtering out low-quality completions (e.g., those that are malformed or in the wrong format), we judge each completion k times, using a reasoning model GRM that is also given access to the category-specific safety specification spec(category). »(6)

Annotation

« “In your answer, consider that another AI determined that ...” »(7)

Annotation

Full SFT or LORA?

arxiv.org/pdf/2308.08747

Annotation

« along with other capabilities data »(8)

Annotation

Need to ensure that we have enough examples for each category, otherwise it will not learn some of the needed specs

Annotation

We can add some amount of heuristic reward score if we have the GT for that, like if proper sections are included, if yes, depending on the completeness give score

Annotation

« We avoid applying direct optimization pressure on the CoT during RL to enable the underlying model to reduce the chance of encouraging deceptive CoTs. »(8)

Annotation

« this particular reward signal for RL was added for training the o1 model and o3-mini. »(8)

Annotation

« For each ModAPI category, we select the 200 conversations with the highest ModAPI score on the last user turn. »(9)

Annotation

« A key reason for this difference is that we updated our safe completion guidelines between the releases of o1-preview and o1. »(9)

Annotation

So we can give results on the internal dataset too as an extra signal?

Annotation

« Error bars are estimated using bootstrap resampling at the 0.95 level. »(13)

Annotation

« We hypothesize this is because StrongREJECT and regulated advice style adherence are more difficult tasks for the model than the others. »(14)

Annotation

« Finally, to measure policy retrieval accuracy, we compute the fraction of prompts where the derived safety category exists and matches the safety category of the prompt. »(17)

did LLM as a judge

Annotation

« Specifically, we extract any excerpts that mention the words {“policy”, “policies”, “guideline”, “allowed”} »(17)

Metadata

Date : 01-08-2025

Authors : Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, Amelia Glaese

Paper Link : http://arxiv.org/abs/2412.16339

Zotero Link: Full Text PDF

Tags : #Computer-Science---Artificial-Intelligence, #Computer-Science---Computation-and-Language, #Computer-Science---Machine-Learning, #Computer-Science---Computers-and-Society

Citation :

Summary

Annotations

Related Notes