Transformers - Positional Encoding

Since transformer input is processed in parallel rather than serially, it is necessary to encode the relative positions of the input sequence of tokens in some way. The positional encoding in the transformer model uses sinusoidal functions to create a unique encoding for each position.

In working through the article on Transformers, as described in the original paper “Attention is All You Need” by Vaswani et al., the following formulas are used to encode the PE tensor values:

$$ PE_{(pos,2i)}=sin(\frac{pos}{10000^{2i/d_{model}}}) $$ $$ PE_{(pos,2i+1)}=cos(\frac{pos}{10000^{2i/d_{model}}}) $$

Where $pos$ is the position in the sequence, $i$ is the position in the positional encoding tensor, and $d_{model}$ is the dimension of the model, which is also used as the length of the embedding tensors which will be assigned to each position in the sequence.

Note below where these Positional Encoding tensors are applied to the Input and Output Embedding tensors:

At first glance, it’s not apparent to me how the encoding values in the PE tensors can provide positional “intelligence” to the transformer model. What is going on here?

The intuition for the PE tensor values created with these formulas has several elements:

Using different frequencies ranging from short to long wavelengths in these sinusoidal functions allows the model to capture positional information at various scales from near to far.
These smooth repeating functions can scale easily to different sequence lengths.
Each position has a unique encoding aided by the alternating use of $sin$ and $cos$ in the even and odd positions.

It is always easier for me to use a visual to understand things better. Using a scaled-down transformer might help to get a better intuition of what this looks like. Let’s use the above formulas and PyTorch to create the PE tensor. We will use a maximum sequence length of twenty and a $d_{model}$ dimension of six:

# model config
seq_len = 20 # maximum tokenized sequence length
d_model = 6 # dimension of the model and the embeddings (meanings)

# Positional encoding.
# Creating the division term for the positional encoding formula
pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len, dtype = torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

# Apply sine to even indices in PE
pe[:, 0::2] = torch.sin(position * div_term)
# Apply cosine to odd indices in PE
pe[:, 1::2] = torch.cos(position * div_term)

The shape of the PE tensor will be $[20, 6]$, or $[seqlen, d_{model}]$. The embedding dimensions are always the same as the model dimension. This PE tensor will be added to the input and output sequence embedding tensors during model processing (see Figure 1.).

Embeddings are created based on the dictionary of tokens. For each token in the training input data, an embedding tensor of length $d_{model}$ is assigned from training. These embedding tensors encode the “meaning”, as understood by the model, via the text it has been trained on.

To see what these positional encoding values look like, for each column in the above PE tensor, I’ve plotted the values:

As you can see, the first two positions are the $sin$ and $cos$ functions. In later positions, the same functions are modified to longer wavelengths, so we can see how adding these functions to the embeddings imbues them with some sense of relative position, from near to far, between the tokens being evaluated.

With some imagination, it’s possible to see how other encodings could be used to represent positional offsets. This would be a fun experiment to try. I’ve put the code from the article here in case you want to play with it.