Exponentially vanishing sub-optimal local minima in multilayer neural networks
Multilayer neural networks, trained with simple variants of stochastic gradient descent (SGD), have achieved state-of-the-art performances in many areas of machine learning. It has long been a mystery why does SGD work so well – rather than converging to sub-optimal local minima with high training error (and therefore, high test error).
We examine a neural network with a single hidden layer, quadratic loss, and piecewise linear units, trained in a binary classification task on a standard normal input. We prove that the volume of differentiable regions of the empiric loss containing sub-optimal differentiable local minima is exponentially vanishing in comparison with the same volume of global minima, given "mild" (polylogarithmic) over-parameterization. This suggests why SGD tends to converge to global minima in such networks.