⃪ About These Notes

Positional Encoding for Transformers

In the encoding layer of a transformer, (so when tokens are converted to embeddings that are fed into the transform neural network), the position of the tokens in the sequence is added to the embedding.

Why does a transformer need positional encoding?

This is done because the transformer doesn't itself track a sequence — it receieves all of the tokens at once. But there is important information lost if the sequence of the tokens is lost: order matters in language.

How?

After creating the embedding a positional encoding function is used. Each position in the embedding uses a different function such that it guarantees a unique positional encoding for each position. It is also intended to capture most sequences used in training within one cycle of the sine/cosine wave so it doesn't repeat.

But why use sine/cosine functions and not just y = x, y = 2x, etc?