Sharp Minima Can Generalize For Deep Nets

1. Summary

This paper criticizes some previous definitions of sharpness (for the local minima of loss function) to be unsuitable under the settings of deep neural networks. However, this phenomenon is because the loss function of deep neural networks typically cannot guarantee the positive definite property of Hessian matrix around local minima and those definitions of sharpness usually require some assumptions include positive definite Hessian and smoothness to guarantee the generalization property.

2. Proof using Re-parameterization Tricks

Deep neural network has the issue of over-parameterization. Consider a one-hidden layer network, y=σ(xθ1)θ2 with ReLU as activation function σ(), a transformation Tα() on the parameter space (θ1,θ2) is

Tα():(θ1,θ2)(αθ1,α1θ2)

This transformation can be extended to deeper networks and convolution layer can also be included. For the sharpness defined below,

Definition. Given ε>0, a minimum θ, and a loss L, we define C(L,θ,ε) as the largest (using inclusion as the partial order over the subsets of θ) connected set containing θ such that θC(L,θ,ε),L(θ)<L(θ)+ε. The ε-flatness will be defined as the volume of C(L,θ,ε). We will call this measure the volume ε-flatness.

Definition. Let B2(ε,θ) be an Euclidean ball centered on a minimum θ with radius ε. Then, for a non-negative valued loss function L, the ε-sharpness will be defined as proportional to

maxθB2(ϵ,θ)(L(θ)L(θ))1+L(θ)|(2L)(θ)|2ϵ22(1+L(θ))

we can use re-parameterization trick to get the result that for neural networks defined above, C(L,θ,ε) has infinite volume and ε-sharpness can be arbitrarily large or small without changing the generalization power.

Moreover, for any Hessian-based or Jacobian-based definition, it is not valid to be considered in deep neural networks settings.

results matching ""

    No results matching ""