Batch Normalization

Proposed in this paper

In batch normalization layer, normalization is based on the batch dimension (i.e. for all instance the features are calculated, and the mean & variance are calculated based on the batch). To make it very simple, think we have to sum up batch_size number of elements and divide by batch_size to calculate the mean.

Steps:

For each batch of features, the mean and variance are calculated
Then these mean and variance are used to do the Standardization on the features
Then Batch normalization use a beta and gamma operator to scale (product by beta) and shift (addition by gamma) the features
Also, it calculated moving mean average and moving variance average that will be used during testing to do the normalization

Pros:

Reduces the effect of Internal Covariate Shift
Works better with large batch size
Works well with CNN models

Cons:

Works poorly with variable size batch size
Works poorly with small batch size
Works poorly with sequence models like Transformer, Attention, RNN, LSTM, GRU

Code

BATCH_SIZE = 2
HIDDEN_DIM = 100
features = torch.randn((BATCH_SIZE, HIDDEN_DIM)) # [BATCH_SIZE, HIDDEN_DIM]
mean = features.mean(dim=0, keepdims=True) # [1, HIDDEN_DIM]
var = features.mean(dim=0, keepdims=True) # [1, HIDDEN_DIM]
features = (features - mean) / var # [BATCH_SIZE, HIDDEN_DIM]

Moving average and beta parts are ignored intentionally to keep it simple

Batch Normalization

Steps:

Pros:

Cons:

Code

References

Related Notes