Why do we use Projection in QKV?
Projection in QKV helps to create different embedding space for different relationship among the token embeddings. It mainly differentiate between "what we are finding" vs "what we are meaning"
if we have taken the embedding itself to find the similarity score, the training would be hard to converge and unstable, as the embedding then have to satisfy two different roles -- finding and contextualized meaning.