Reinforcement Learning from Human Feedback (RLHF)

RLHF has multiple steps involved. In the following diagram you can see the whole process.

Borrowed from Reference 3

Step 1: Pre-Training Language Model (LM)

In this step, any LM is trained on a large collection of low quality data, or anyone can use already pre-trained LM like GPT3 and skip this step. The main idea of this step is to get a initially pre-trained model which has a sense of the real life data.

The input corpus for this step can be just a corpus of text and trained on Next Sentence Prediction or Masked Language Modeling. So the training would be Unsupervised Learning.

Step 2: Supervised Finetuning (SFT) for dialogue

Previously the LM was pre-trained on the whole internet data. But the chatGPT or any dialogue model is trained to output the response of the user's question. Without this step, if the pretrained model gets an input of "what are the tourist spots", it can have one of the three responses:

  1. in america?
  2. ? What are ways to visit them?
  3. Or the answer of the question
    We want our model to generate the option 3 always. So instead we fine-tuned the pre-trained model on high quality dialogue data.

In this step, the training corpus would be pair of (prompt, response) and the model will be trained on typical Cross Entropy loss using Supervised Learning.

Step 3: RLHF:

In this step, the fine-tuned model is again trained using Reinforcement Learning and the feedback of the model comes from the human. There are 2 sub-steps:

Step 3.1: Training the Reward Model (RM):

In this step, another model is trained to replicate the human feedback behavior. This model is trained on a human labeled corpus, where the data format is (prompt, good_response, bad_response). Instead of this corpus, if we can get a single score for a prompt, then it would be much easier to train. But giving a score to a response is biased and unstable. The more stable version is to compare between two response.

This reward model can be something pre-trained, from scratch or the SFT model. In practice, the RM model needs to be better than the SFT model at least, so using a SFT model as a seed model is a better approach.

The RM model is then trained with below loss function to get one single score given the prompt and a response.

prompt=xwinresponse=ywloseresponse=ylscorew=r(x,yw)scorel=r(x,yl)loss=log(σ(scorewscorel))

Step 3.2: Fine Tune the LM using the RM scoring:

This is the final step of the RLHF. The objective of this step is to score the responses generated by the current model to mimic the human feedback. So if some response gets a bad score from the RM model, the LM models tries to fix it by not generating or by updating that response.

We could have just used the negative scoring from the RM as the loss value and back-propagate on it. The problem is the input of the RM is the decoded text and the decoding procedure is non-differentiable. Here comes the used of Reinforcement Learning to use the score as a policy function. In chatGPT, they have used Proximal Policy Optimization (PPO).

During this process, the input data is randomly sampled prompt, the LM model outputs a response and that response goes to the RM to get a score which is used in reinforcement learning and back-propagate. We have also used KL Divergence between the initial LM output and current LM output to not diverge a lot from the base output. It was done so that the LM model doesn't give some gibberish output to fool the RM.

Full RLHF Process

Finally the fine-tuned, SFT LM model is used.


References

  1. https://huggingface.co/blog/rlhf
  2. https://gist.github.com/JoaoLages/c6f2dfd13d2484aa8bb0b2d567fbf093
  3. https://huyenchip.com/2023/05/02/rlhf.html

Related Notes