Token Assorted - Mixing Latent and Text Tokens for Improved Language Model Reasoning

Summary

3+ Most Important Things

1+ Deficiencies

3+ New Ideas

Annotations

Annotation

« In this work, we propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens generated by VQ-VAE, significantly reducing the length of reasoning traces. »()

Annotation

« To facilitate effective learning, we introduce a simple training procedure that randomly mixes latent and text tokens, which enables fast adaptation to new latent tokens »()

Annotation

« we propose to use discrete latent tokens to abstract the initial steps of the reasoning traces. »()

Annotation

« More precisely, we replace the text tokens with their corresponding latent abstractions from left to right until a pre-set location, leaving the remaining tokens unchanged. »()

Annotation

« we employ a randomized replacement strategy: during training, we randomly vary the number of text tokens being substituted by latent tokens for each sample. »(2)

Annotation

« our VQ-VAE is trained on the whole input sequence X, but only applied to C in the next stage »(3)

Annotation

« When applying the VQ-VAE to compress the text tokens, the discrete latent tokens Z are essentially the index of corresponding embeddings in the codebook. »(3)

[!Annotation|#ffd400]+
« We delimit the latent tokens by injecting a special and tokens to encapsulate them. »(4)
Annotation

« one remarkable challenge is to deal with the extended vocabulary »(4)

Annotation

« In the context of our approach, this means we increase the values of m in each stage until it reaches a pre-set cap value. »(4)

Annotation

« where dedicated optimization tuning is needed »(4)

Annotation

« where the value of m is randomly set for each sample »(4)

Annotation

« For each benchmark, we train a VQVAE for 100k steps using the Adam optimizer, with learning rate 10−5 and batch size 32. We use a codebook of size 1024 and compress every chunk of L = 16 tokens into a single latent token (i.e., the compression rate r = 16). »(5)


Related Notes