Investigating Continual Pretraining in Large Language Models - Insights and Implications

Summary

In this paper, the authors have used a VQ-VAE approach to compress down the reasoning path to discrete latent tokens and then use them to fine-tune a model
- To achieve that, they have used 2 layers of training
  - Did a training based on reconstruction loss where the input is the Prompt, COT, Solution and it will be compressed and quantized to K vectors and then the whole input has to be reconstructed given only the prompt and the quantized embedding

on the next stage, every input was transformed by the quantized vector as shown in the following figure and then using cross entropy loss fine-tuned
During training, from left to right of CoT they have compressed m tokens. they have trained using mixed samples of varying lengths of m. The m is chosen randomly.
- They have chosen a chunk of 16 tokens and they compress each 16 tokens to one single latent token
- The codebook of size 1024
- this m was chosen from the set of M = {0, 72, 128, 160, 192, 224, 256}, all multiples of 16 tokens
Unlike the CCOT paper, their method shows real improvement even after the compression (even on different datasets of similar domain)

Annotations

Annotation

« (i) when the sequence of domains shows semantic similarity, continual pretraining enables LLMs to better specialize in the current domain compared to stand-alone fine-tuning »()

Annotation

« (ii) training across a diverse range of domains enhances both backward and forward knowledge transfer »()

Annotation

« (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both forgetting and learning. »()

Metadata

Date : 02-27-2024

Authors : Çağatay Yıldız, Nishaanth Kanna Ravichandran, Prishruit Punia, Matthias Bethge, Beyza Ermis

Paper Link : http://arxiv.org/abs/2402.17400

Zotero Link: Full Text PDF

Tags : #Computer-Science---Computation-and-Language

Citation :

Summary

Annotations

Related Notes