Deliberative Alignment - Reasoning Enables Safer Language Models

Summary

Annotations

Annotation

« We used this approach to align OpenAI’s o-series models [1], and achieved highly precise adherence to OpenAI’s safety policies, without requiring human-written chain-of-thoughts or answers. »()

Annotation

« We propose deliberative alignment, a training approach that teaches LLMs to explicitly reason through safety specifications before producing an answer. »()

Annotation

« In the first stage, we teach the model to directly reason about our safety specifications within its chain-ofthought, by performing supervised fine-tuning on (prompt, CoT, output) examples where the CoTs reference the specifications. »()

Annotation

« Concretely, we present the model with the safety specifications as part of the system prompt, generate model completions, and then strip away the system prompts to form the final dataset. »()

Annotation

« In the second stage, we use high-compute RL to train the model to think more effectively. »(2)

Annotation

« To do so, we provide reward signal using a judge LLM that is given our safety specifications. »(2)

Annotation

« This addresses a major challenge of standard LLM safety training—its heavy dependence on large-scale, human-labeled data: »(2)

Annotation

« given access to our actual safety policies, o1 models are often able to correctly reason over how to respond to potentially unsafe prompts. »(3)

Annotation

« Indeed, as we find in Section 4.1, deliberative alignment more reliably aligns the model to specifications than providing those specifications at deployment time. »(3)

Annotation

« Examples of safety categories include: erotic content, extremism, harassment, illicit behavior, regulated advice, self-harm, and violence. »(4)

Annotation

« we formulate category-specific policy specifications (denoted as spec(category) that provide high level details about all the safety categories (as well as principles of style and helpfulness) and granular details only about the relevant category. »(4)

Annotation

Not relevant but shouldnt the conversation ends on assistant turn?

got it on the next section, the output is about the last user turn

Annotation

« Specifically, after filtering out low-quality completions (e.g., those that are malformed or in the wrong format), we judge each completion k times, using a reasoning model GRM that is also given access to the category-specific safety specification spec(category). »(6)

Annotation

« “In your answer, consider that another AI determined that ...” »(7)

Annotation

Full SFT or LORA?

arxiv.org/pdf/2308.08747

Annotation

« along with other capabilities data »(8)

Annotation

Need to ensure that we have enough examples for each category, otherwise it will not learn some of the needed specs

Annotation

We can add some amount of heuristic reward score if we have the GT for that, like if proper sections are included, if yes, depending on the completeness give score

Annotation

« We avoid applying direct optimization pressure on the CoT during RL to enable the underlying model to reduce the chance of encouraging deceptive CoTs. »(8)

Annotation

« this particular reward signal for RL was added for training the o1 model and o3-mini. »(8)

Annotation

« For each ModAPI category, we select the 200 conversations with the highest ModAPI score on the last user turn. »(9)

Annotation

« A key reason for this difference is that we updated our safe completion guidelines between the releases of o1-preview and o1. »(9)

Annotation

So we can give results on the internal dataset too as an extra signal?

Annotation

« Error bars are estimated using bootstrap resampling at the 0.95 level. »(13)

Annotation

« We hypothesize this is because StrongREJECT and regulated advice style adherence are more difficult tasks for the model than the others. »(14)

Annotation

« Finally, to measure policy retrieval accuracy, we compute the fraction of prompts where the derived safety category exists and matches the safety category of the prompt. »(17)

did LLM as a judge

Annotation

« Specifically, we extract any excerpts that mention the words {“policy”, “policies”, “guideline”, “allowed”} »(17)


Related Notes