Multi-Query Attention
In Multi-Head Attention, every head (i.e. 8) has its own query, key and value. So for 8 head attention, there are 8 queries, 8 keys and 8 values.
On the other hand, in the multi-query attention, all the heads share one key and value.
This was introduced to save inference time, VRAM memory. It significantly reduces the VRAM memory, but at the cost of performance degradation.
