Multi-Head Latent Attention

#llm #interview

Multi-head latent attention is the same as the Multi-Head Attention, but the key, value is compressed and then saved to the cache. This dramatically reduces the KV cache size.

During inference, at time t, the query value is generated, and then from the latent compressed KV cache value, they are projected to the original form. Then that project KV is used as the key, value pair.

During training, the query, key, value all are compressed to the latent space.
During inference, only key and value are compressed to the latent space.

Example models: Deepseek V2, V3

References

https://sebastianraschka.com/llms-from-scratch/ch04/05_mla/

Related Notes

Hypothetical Document Embedding (HyDE)
Self-Attention
Causal Language Modeling
Prompt Engineering