Cross-Attention

#llm #interview

Cross attention is usually used in the Encoder-Decoder Transformer, where decoder not only attend to the token that they have generated, but also looks at the input sequence. It is done by using the key and value from the encoder embedding and query from the last generated token.

Intuitively, it is asking given the current token as query find me the tokens that I should attend from the encoder embeddings.

References

https://jalammar.github.io/illustrated-gpt2/

Related Notes

Gradient Descent
TF-IDF
Essential Visualizations
Co-occurrence based Word Embeddings
Skip Gram Model