What is batch normalization

Ordering batch normalization and failure?

The original question was specifically about TensorFlow implementations. However, the answers relate to implementations in general. This general answer is also the correct answer for TensorFlow.

Do I have to worry about ordering when using batch normalization and dropout in TensorFlow (especially when using the Contrib.Layer)?

It seems possible that there will be problems when I use dropout followed by batch normalization. For example, if the shift in the batch normalization is switched to the larger scale numbers of the training outputs, but this shift is applied to the smaller scale numbers (due to the compensation for more outputs) without failure during the test, then this is the case circuit can be switched off. Does the TensorFlow batch normalization layer automatically compensate for this? Or isn't that happening for some reason I miss?

Are there any other pitfalls to watch out for when using these two together? For example, assuming that I am using them in the correct order in the order above (assuming it lies correct sequence), can problems arise when using batch normalization and dropout on several consecutive levels? I don't immediately see a problem with this, but I may be missing something.

Many Thanks!


An experimental test seems to show that okay does Matter. I have run the same network twice with only the batch norm and the reverse failure. If the failure is ahead of the batch norm, the loss of validation appears to increase as the loss of training decreases. Otherwise they both go under. But in my case the movements are slow so things can change after more training and it's just a single test. A more definitive and informed answer would still be welcome.


In Ioffe and Szegedy 2015, the authors state that "we want to ensure that the network always generates activations with the desired distribution for all parameter values". The batch normalization layer is thus inserted directly after a conv-layer / fully connected layer, but before it is fed into the ReLu activation (or another type of activation). For more information, see this video at 53 minutes.

As for dropout, I believe that dropout is applied after the activation layer. In the dropout paper Figure 3b, the dropout factor / probability matrix r (l) for the hidden layer l is applied to y (l), where y (l) is the result after applying the activation function f.

In summary, the order for using batch normalization and dropout is:

-> CONV / FC -> BatchNorm -> ReLu (or other activation) -> Dropout -> CONV / FC ->

As mentioned in the comments, an amazing resource on the order of Layers to Read is here. I looked over the comments and it's the best resource on the subject that I have found on the internet

My 2 cents:

Dropout is designed to completely block information from certain neurons to ensure that the neurons do not adapt together. The batch normalization must therefore take place after the failure, otherwise you will pass information through the normalization statistics.

If you think about it, for typical ML problems, we don't calculate the mean and standard deviation over the entire data and then break them down into pull, test, and validation sets. We then split and compute the statistics on the train set and use them to normalize and center the validation and test data sets

so I suggest scheme 1 (this takes into account Pseudomarvin's comment on the accepted answer)

-> CONV / FC -> ReLu (or other activation) -> Dropout -> BatchNorm -> CONV / FC

in contrast to scheme 2

-> CONV / FC -> BatchNorm -> ReLu (or other activation) -> Dropout -> CONV / FC -> in the accepted answer

Please note that this means that the network under Scheme 2 should be overfitted compared to the network under Scheme 1, but OP has performed some tests as questioned and supports Scheme 2

Usually just drop the (if you have):

  • "BN makes it redundant in some cases because BN intuitively offers regularization benefits similar to dropout intuitive."
  • "Architectures like ResNet, DenseNet etc. are not used

For more details, see this article [Understanding the Disharmony Between Dropout and Batch Normalization by Shifting Variance], as mentioned in the comments by @Haramoz.

I found a paper that explains the disharmony between Dropout and Batch Norm (BN). The key idea is what it is Call "variance shift" . This is due to the fact that dropout behaves differently between the training and testing phases, which shifts the input statistics learned by BN. The main idea can be found in this figure taken from this paper.

A small demo of this effect can be found in this notebook.

Based on the research report for better performance, we should use BN before applying dropouts

The correct order is: Conv> Normalization> Activation> Dropout> Pooling

Conv - Activation - DropOut - BatchNorm - Pool -> Test_loss: 0.04261355847120285

Conv - Activation - DropOut - Pool - BatchNorm -> Test_loss: 0.050065308809280396

Conv - Activation - BatchNorm - Pool - DropOut -> Test_loss: 0.04911309853196144

Conv - Activation - BatchNorm - DropOut - Pool -> Test_loss: 0.06809622049331665

Conv - BatchNorm - Activation - DropOut - Pool -> Test_loss: 0.038886815309524536

Conv - BatchNorm - Activation - Pool - DropOut -> Test_loss: 0.04126095026731491

Conv - BatchNorm - DropOut - Activation - Pool -> Test loss: 0.05142546817660332

Conv - DropOut - Activation - BatchNorm - Pool -> Test_loss: 0.04827788099646568

Conv - DropOut - Activation - Pool - BatchNorm -> Test_loss: 0.04722036048769951

Conv - DropOut - BatchNorm - Activation - Pool -> Test loss: 0.03238215297460556

Trained on the MNIST data set (20 epochs) with 2 convolution modules (see below), followed by each with

The convolution layers have a kernel size of standard padding that is activation. The pooling is a MaxPooling of the pool. Loss is and the optimizer is.

The corresponding dropout probability is or are. The amount of features maps or is.

To edit: When I dropped the dropout as recommended in some answers, it converged faster but had worse generalization ability than when using BatchNorm and Dropout.

ConV / FC - BN - Sigmoid / Tanh - failure. If the activation function is Relu or some other function, the order of normalization and cancellation depends on your task

I've read the recommended articles in the answer and comments from https://stackoverflow.com/a/40295999/8625228

From the point of view of Ioffe and Szegedy (2015), only use BN in the network structure. Li et al. (2018) the statistical and experimental analyzes indicate that there is a shift in variance when practitioners use dropout before BN. For example, Li et al. (2018) recommend using dropout after all BN layers.

From the point of view of Ioffe and Szegedy (2015), BN within / before the activation function. Chen et al. (2019) use an IC layer that combines dropout and BN, and Chen et al. (2019) recommends the use of BN according to ReLU.

For security reasons I only use Dropout or BN in the network.

Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao and Shengyu Zhang. 2019. "Rethink the use of batch normalization and cancellation in deep neural network training." CoRR abs / 1905.05928. http://arxiv.org/abs/1905.05928.

Ioffe, Sergey and Christian Szegedy. 2015. "Batch normalization: Acceleration of the deep network training by reducing the internal covariate shift." CoRR abs / 1502.03167. http://arxiv.org/abs/1502.03167.

Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. "Understand the disharmony between failure and batch normalization through variance shift." CoRR abs / 1801.05134. http://arxiv.org/abs/1801.05134.

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.