A simple implementation of a sinusoidal positional encoding for transformer neural networks
The transformer has revolutionized the field of machine learning, particularly in the realm of natural language processing. Unlike LSTMs, transformers like GPT can process sequential data points concurrently. To preserve the position of samples within the sequence, each sample is usually associated with a positional encoding.
In their famous paper “Attention is all you need”, Vaswani, et al. (2017) introduced the following fixed, sinusoidal positional encoding
\[E_{p,2i}=\sin\left(\frac{p}{\sqrt[d]{10000^{2i}}}\right)\] \[E_{p,2i+1}=\cos\left(\frac{p}{\sqrt[d]{10000^{2i}}}\right)\]
Where \(p\) denotes the position of the sample within the sequence and \(i\) is the feature of a sample with dimension \(d\). The pattern generated by this encoding function resembles a continuous count sequence.
How to implement a positional encoding using PyTorch
I implemented this positional encoding using PyTorch. When forwarding an input, it automatically adjusts to its dimensions and device (CPU/GPU). It expects tensors of shape \((\mathrm{sample}, \mathrm{length}, \mathrm{features})\).
class SinusoidalPosEncoding(torch.nn.Module):
def __init__(self, p=0.0):
super(SinusoidalPosEncoding, self).__init__()
self.l = None # sequence length
self.k = None # number of features
self.encoding = None
def forward(self, x):
# assume x has shape (sample, length, features)
if x.shape[1] != self.l \
or x.shape[2] != self.k \
or x.device != self.encoding.device:
self.l = x.shape[1]
self.k = x.shape[2]
self.encoding = torch.zeros((self.l, self.k), device=x.device)
t = 1 / 10000**(torch.arange(0, self.k, 2) / self.k)
k = torch.arange(self.l)
v = torch.outer(k, t)
self.encoding[:, 0::2] = v.sin()
self.encoding[:, 1::2] = v.cos()
return x + self.encoding
Why is the positional encoding added and not concatenated to the input encoding?
Intuitively, adding as opposed to concatenating the positional and input encodings could lead to a loss of information since subsequent layers may not be able to separate the relevant embedding and positional features from the sum.
There are two reasons why addition works:
- The input embedding is learned, while the positional encoding is fixed. Thus, the input embedding can adjust (e.g. by increasing its values significantly) to force its influence on subsequent layers if this improves the model’s performance.
- As can be seen in the above figure, the features on the right side of the positional encoding barely change with position. Thus, one may expect these features to be used primarily by the input encoding layer while the left features are mainly utilized to preserve positional information.
Overall, addition is quite elegant as it allows the model to adjust how much positional and input information to pass forward. Moreover, addition as opposed to concatenation also limits the size of the tensors and, thus, the computational power needed to process them.