Batch normalisation works by first standardising the inputs, then scales linearly - coefficients determined through training. This occurs between each layer.
Outcomes of this process:
epochs take longer, but less epochs are required.
Benefits:
Batch normalisation occurs at each layer, so do not need separate normalisation step for input data.
What about bias? We do not need bias in BN.
Example:
Introducing BN into this model.
Do you put BN before or after a activation function? Author of Paper suggests before.