What is batch normalization
Ordering batch normalization and failure?
The original question was specifically about TensorFlow implementations. However, the answers relate to implementations in general. This general answer is also the correct answer for TensorFlow.
Do I have to worry about ordering when using batch normalization and dropout in TensorFlow (especially when using the Contrib.Layer)?
It seems possible that there will be problems when I use dropout followed by batch normalization. For example, if the shift in the batch normalization is switched to the larger scale numbers of the training outputs, but this shift is applied to the smaller scale numbers (due to the compensation for more outputs) without failure during the test, then this is the case circuit can be switched off. Does the TensorFlow batch normalization layer automatically compensate for this? Or isn't that happening for some reason I miss?
Are there any other pitfalls to watch out for when using these two together? For example, assuming that I am using them in the correct order in the order above (assuming it lies correct sequence), can problems arise when using batch normalization and dropout on several consecutive levels? I don't immediately see a problem with this, but I may be missing something.
An experimental test seems to show that okay does Matter. I have run the same network twice with only the batch norm and the reverse failure. If the failure is ahead of the batch norm, the loss of validation appears to increase as the loss of training decreases. Otherwise they both go under. But in my case the movements are slow so things can change after more training and it's just a single test. A more definitive and informed answer would still be welcome.
In Ioffe and Szegedy 2015, the authors state that "we want to ensure that the network always generates activations with the desired distribution for all parameter values". The batch normalization layer is thus inserted directly after a conv-layer / fully connected layer, but before it is fed into the ReLu activation (or another type of activation). For more information, see this video at 53 minutes.
As for dropout, I believe that dropout is applied after the activation layer. In the dropout paper Figure 3b, the dropout factor / probability matrix r (l) for the hidden layer l is applied to y (l), where y (l) is the result after applying the activation function f.
In summary, the order for using batch normalization and dropout is:
-> CONV / FC -> BatchNorm -> ReLu (or other activation) -> Dropout -> CONV / FC ->
As mentioned in the comments, an amazing resource on the order of Layers to Read is here. I looked over the comments and it's the best resource on the subject that I have found on the internet
My 2 cents:
Dropout is designed to completely block information from certain neurons to ensure that the neurons do not adapt together. The batch normalization must therefore take place after the failure, otherwise you will pass information through the normalization statistics.
If you think about it, for typical ML problems, we don't calculate the mean and standard deviation over the entire data and then break them down into pull, test, and validation sets. We then split and compute the statistics on the train set and use them to normalize and center the validation and test data sets
so I suggest scheme 1 (this takes into account Pseudomarvin's comment on the accepted answer)
-> CONV / FC -> ReLu (or other activation) -> Dropout -> BatchNorm -> CONV / FC
in contrast to scheme 2
-> CONV / FC -> BatchNorm -> ReLu (or other activation) -> Dropout -> CONV / FC -> in the accepted answer
Please note that this means that the network under Scheme 2 should be overfitted compared to the network under Scheme 1, but OP has performed some tests as questioned and supports Scheme 2
Usually just drop the (if you have):
- "BN makes it redundant in some cases because BN intuitively offers regularization benefits similar to dropout intuitive."
- "Architectures like ResNet, DenseNet etc. are not used
For more details, see this article [Understanding the Disharmony Between Dropout and Batch Normalization by Shifting Variance], as mentioned in the comments by @Haramoz.
I found a paper that explains the disharmony between Dropout and Batch Norm (BN). The key idea is what it is Call "variance shift" . This is due to the fact that dropout behaves differently between the training and testing phases, which shifts the input statistics learned by BN. The main idea can be found in this figure taken from this paper.
A small demo of this effect can be found in this notebook.
Based on the research report for better performance, we should use BN before applying dropouts
The correct order is: Conv> Normalization> Activation> Dropout> Pooling
Conv - Activation - DropOut - BatchNorm - Pool -> Test_loss: 0.04261355847120285
Conv - Activation - DropOut - Pool - BatchNorm -> Test_loss: 0.050065308809280396
Conv - Activation - BatchNorm - Pool - DropOut -> Test_loss: 0.04911309853196144
Conv - Activation - BatchNorm - DropOut - Pool -> Test_loss: 0.06809622049331665
Conv - BatchNorm - Activation - DropOut - Pool -> Test_loss: 0.038886815309524536
Conv - BatchNorm - Activation - Pool - DropOut -> Test_loss: 0.04126095026731491
Conv - BatchNorm - DropOut - Activation - Pool -> Test loss: 0.05142546817660332
Conv - DropOut - Activation - BatchNorm - Pool -> Test_loss: 0.04827788099646568
Conv - DropOut - Activation - Pool - BatchNorm -> Test_loss: 0.04722036048769951
Conv - DropOut - BatchNorm - Activation - Pool -> Test loss: 0.03238215297460556
Trained on the MNIST data set (20 epochs) with 2 convolution modules (see below), followed by each with
The convolution layers have a kernel size of standard padding that is activation. The pooling is a MaxPooling of the pool. Loss is and the optimizer is.
The corresponding dropout probability is or are. The amount of features maps or is.
To edit: When I dropped the dropout as recommended in some answers, it converged faster but had worse generalization ability than when using BatchNorm and Dropout.
ConV / FC - BN - Sigmoid / Tanh - failure. If the activation function is Relu or some other function, the order of normalization and cancellation depends on your task
I've read the recommended articles in the answer and comments from https://stackoverflow.com/a/40295999/8625228
From the point of view of Ioffe and Szegedy (2015), only use BN in the network structure. Li et al. (2018) the statistical and experimental analyzes indicate that there is a shift in variance when practitioners use dropout before BN. For example, Li et al. (2018) recommend using dropout after all BN layers.
From the point of view of Ioffe and Szegedy (2015), BN within / before the activation function. Chen et al. (2019) use an IC layer that combines dropout and BN, and Chen et al. (2019) recommends the use of BN according to ReLU.
For security reasons I only use Dropout or BN in the network.
Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao and Shengyu Zhang. 2019. "Rethink the use of batch normalization and cancellation in deep neural network training." CoRR abs / 1905.05928. http://arxiv.org/abs/1905.05928.
Ioffe, Sergey and Christian Szegedy. 2015. "Batch normalization: Acceleration of the deep network training by reducing the internal covariate shift." CoRR abs / 1502.03167. http://arxiv.org/abs/1502.03167.
Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. "Understand the disharmony between failure and batch normalization through variance shift." CoRR abs / 1801.05134. http://arxiv.org/abs/1801.05134.
- How to use the Dropbox app
- Why are all stock prices falling
- What if Svalbard were to gain independence from Norway?
- What are homoatomic and heteroatomic molecules
- What is the DEAs Special Operations Division
- Got your high school freshman Friday
- Why is nitinol classified as an alloy
- What do Filipinos think of Guam
- Glass blowing is a lucrative business
- Why is H2O2 less stable than H2O
- Are potatoes kept longer in the fridge?
- What does Southpaw mean in cricket
- Masturbation makes you irritable
- What does global competition index mean
- Know more about astrologers
- What does subversion
- Trust a large technology company
- Microscope adapter for digital camera
- Whatever happened to Roger Federer's cows
- What do you think of Kentucky
- What is a kiosk banking service
- Can a rotten tooth be saved?
- How is Benzene SP2 hybridized
- How is Hellboy