Positional Encoding in Transformer

#llm #interview

While attention mechanism helps to attend to all the information at the same time, it loses the order of the text. The order of the text matters like "The dog chased the cat" and "the cat chased the dog" - is not same thing.

To add some sense of the order, in the transformer paper, the authors have added the positional encoding to the word embedding.

There are 2 kinds of positional encoding:

Fixed Encoding: In the transformer paper, the authors have used the sine and cosine information for the positional encoding. Learn more from Why Trigonometric Function for Positional Encoding?

P E (p o s, 2 i) = s i n (\frac{p o s}{10000^{\frac{2 i}{d}}})

P E (p o s, 2 i + 1) = c o s (\frac{p o s}{10000^{\frac{2 i}{d}}})

Learned Embedding: In the learned embedding, like the word embedding, the models learn about the position. So there is another extra embedding layer for the position

With the positional encoding, added to the word embedding, the model can simultaneously look at different tokens with their order embedded into the representation.

Positional Encoding in Transformer

References

Related Notes