A simple implementation of a sinusoidal positional encoding for transformer neural networks

The transformer has revolutionized the field of machine learning, particularly in the realm of natural language processing. Unlike LSTMs, transformers like GPT can process sequential data points concurrently. To preserve the position of samples within the sequence, each sample is usually associated with a positional encoding.

In their famous paper “Attention is all you need”, Vaswani, et al. (2017) introduced the following fixed, sinusoidal positional encoding

\[E_{p,2i}=\sin\left(\frac{p}{\sqrt[d]{10000^{2i}}}\right)\] \[E_{p,2i+1}=\cos\left(\frac{p}{\sqrt[d]{10000^{2i}}}\right)\]

Where \(p\) denotes the position of the sample within the sequence and \(i\) is the feature of a sample with dimension \(d\). The pattern generated by this encoding function resembles a continuous count sequence.

Sinusoidal positional encoding
The sinusoidal positional encoding for a 128-dimensional sequence of length 100. [View uncompressed]

How to implement a positional encoding using PyTorch

I implemented this positional encoding using PyTorch. When forwarding an input, it automatically adjusts to its dimensions and device (CPU/GPU). It expects tensors of shape \((\mathrm{sample}, \mathrm{length}, \mathrm{features})\).

class SinusoidalPosEncoding(torch.nn.Module):

	def __init__(self, p=0.0):
		super(SinusoidalPosEncoding, self).__init__()
		self.l = None # sequence length
		self.k = None # number of features
		self.encoding = None

	def forward(self, x):
		# assume x has shape (sample, length, features)
		if x.shape[1] != self.l \
		or x.shape[2] != self.k \
		or x.device != self.encoding.device:
			self.l = x.shape[1]
			self.k = x.shape[2]
			self.encoding = torch.zeros((self.l, self.k), device=x.device)

			t = 1 / 10000**(torch.arange(0, self.k, 2) / self.k)
			k = torch.arange(self.l)
			v = torch.outer(k, t)

			self.encoding[:, 0::2] = v.sin()
			self.encoding[:, 1::2] = v.cos()

		return x + self.encoding

Why is the positional encoding added and not concatenated to the input encoding?

Intuitively, adding as opposed to concatenating the positional and input encodings could lead to a loss of information since subsequent layers may not be able to separate the relevant embedding and positional features from the sum.

There are two reasons why addition works:

  1. The input embedding is learned, while the positional encoding is fixed. Thus, the input embedding can adjust (e.g. by increasing its values significantly) to force its influence on subsequent layers if this improves the model’s performance.
  2. As can be seen in the above figure, the features on the right side of the positional encoding barely change with position. Thus, one may expect these features to be used primarily by the input encoding layer while the left features are mainly utilized to preserve positional information.

Positional encoding plus input embedding
The learned input encoding adjusts to the fixed positional embedding to preserve the relevant information of both. [View uncompressed]

Overall, addition is quite elegant as it allows the model to adjust how much positional and input information to pass forward. Moreover, addition as opposed to concatenation also limits the size of the tensors and, thus, the computational power needed to process them.