Investigating Continual Pretraining in Large Language Models - Insights and Implications
Summary
- In this paper, the authors have used a VQ-VAE approach to compress down the reasoning path to discrete latent tokens and then use them to fine-tune a model
- To achieve that, they have used 2 layers of training
- Did a training based on reconstruction loss where the input is the Prompt, COT, Solution and it will be compressed and quantized to K vectors and then the whole input has to be reconstructed given only the prompt and the quantized embedding
- To achieve that, they have used 2 layers of training
- on the next stage, every input was transformed by the quantized vector as shown in the following figure and then using cross entropy loss fine-tuned
- During training, from left to right of CoT they have compressed m tokens. they have trained using mixed samples of varying lengths of m. The m is chosen randomly.
- They have chosen a chunk of 16 tokens and they compress each 16 tokens to one single latent token
- The codebook of size 1024
- this m was chosen from the set of M = {0, 72, 128, 160, 192, 224, 256}, all multiples of 16 tokens
- Unlike the CCOT paper, their method shows real improvement even after the compression (even on different datasets of similar domain)
Annotations
Annotation
Annotation
Annotation
Metadata
Date : 02-27-2024
Authors : Çağatay Yıldız, Nishaanth Kanna Ravichandran, Prishruit Punia, Matthias Bethge, Beyza Ermis
Paper Link : http://arxiv.org/abs/2402.17400
Zotero Link: Full Text PDF
Tags : #Computer-Science---Computation-and-Language
Citation :