Why do we use Projection in QKV?

#llm #interview

Projection in QKV helps to create different embedding space for different relationship among the token embeddings. It mainly differentiate between "what we are finding" vs "what we are meaning"

if we have taken the embedding itself to find the similarity score, the training would be hard to converge and unstable, as the embedding then have to satisfy two different roles -- finding and contextualized meaning.

References

Related Notes

Why Trigonometric Function for Positional Encoding?
Adjusted R-squared Value
Beam Search
Plots Compared
Additive Attention