On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
1. Summary
This paper discusses the phenomenon the generalization gap (test error drop) for the large-batch SGD. It states that large-batch SGD training is more likely to converge to the sharp minima while the small-batch SGD converges to the flat minima, which is due to the inherent noise in the training.
2. Generalization Gap for Large-batch
Empirically, large-batch SGD experience a drop in test error and by experiment, we can check that this is not caused by over-fitting. Another explanation could be sharpness, as in the lens of the minimum description length (MDL) theory, which states that statistical models that require fewer bits to describe generalize better. Since flat minima can be specified with lower precision than to sharp minima, they tend to have better generalization performance. Alternative theory of sharpness and generalization are presented through the Bayesian view of learning and the lens of free Gibbs energy.
3. Sharpness and Explanation
As defined in the paper, let
where denotes the pseudo-inverse of and controls the size of the box. Therefore, we define the -sharpness of at as:
Using the defined sharpness, the local minima found by large-batch method has larger sharpness than that found by small-batch empirically.
Although they are all unbiased gradient estimators, large-batch has small variance which makes it more likely to be stuck in a sharp minima. However, as small-batch introduces much more noise, it has the property of exploration and therefore can avoid the trap of sharp minima. Large-batch methods lack this explorative properties of small-batch methods and tend to zoom-in on the minima closest to the initial point.
However, this definition of sharpness is not very rigorous in a deep neural network setting, as will be shown in Sharp minima can generalize for deep nets.