Layer Normalization

Unlike Batch Normalization, layer normalization is done on each instance rather than each batch (i.e. for a mini batch of input features are calculated, the for each row of features the mean and variance are calculated and used to normalize). To make it simple, think that you have to sum up hidden_dim number of elements and divide by hidden_dim to calculate the mean.

As they are not dependent on batch like Batch Normalization but dependent on the hidden layer only, we don't need any other parameters for inference time. During inference time, we can just use the mean and variance of test features.

Code

BATCH_SIZE = 2
HIDDEN_DIM = 100
features = torch.randn((BATCH_SIZE, HIDDEN_DIM)) # [BATCH_SIZE, HIDDEN_DIM]
mean = features.mean(dim=1, keepdims=True) # [BATCH_SIZE, 1]
var = features.mean(dim=1, keepdims=True) # [BATCH_SIZE, 1]
features = (features - mean) / var # [BATCH_SIZE, HIDDEN_DIM]

Pros:

  1. As they are not dependent on batch_size, they work well for the sequence models like Transformer, Attention, RNN, LSTM, GRU
  2. Works well with small batch size

References

  1. https://medium.com/@prudhviraju.srivatsavaya/layer-normalisation-and-batch-normalisation-c315ffe9a84b
  2. https://towardsdatascience.com/batch-norm-explained-visually-how-it-works-and-why-neural-networks-need-it-b18919692739
  3. https://www.linkedin.com/pulse/understanding-batch-normalization-layer-group-implementing-pasha-s/

Related Notes