Sharp Minima Can Generalize For Deep Nets
1. Summary
This paper criticizes some previous definitions of sharpness (for the local minima of loss function) to be unsuitable under the settings of deep neural networks. However, this phenomenon is because the loss function of deep neural networks typically cannot guarantee the positive definite property of Hessian matrix around local minima and those definitions of sharpness usually require some assumptions include positive definite Hessian and smoothness to guarantee the generalization property.
2. Proof using Re-parameterization Tricks
Deep neural network has the issue of over-parameterization. Consider a one-hidden layer network, with ReLU as activation function , a transformation on the parameter space is
This transformation can be extended to deeper networks and convolution layer can also be included. For the sharpness defined below,
Definition. Given , a minimum , and a loss , we define as the largest (using inclusion as the partial order over the subsets of ) connected set containing such that . The -flatness will be defined as the volume of . We will call this measure the volume -flatness.
Definition. Let be an Euclidean ball centered on a minimum with radius . Then, for a non-negative valued loss function , the -sharpness will be defined as proportional to
we can use re-parameterization trick to get the result that for neural networks defined above, has infinite volume and -sharpness can be arbitrarily large or small without changing the generalization power.
Moreover, for any Hessian-based or Jacobian-based definition, it is not valid to be considered in deep neural networks settings.