Departmental colloquium 17.11.22
Dr. Yakir Berchenko
Ben-Gurion University
Baroque Simplicity: Simplicity Bias in Overparameterized Machine Learning
The contemporary practice in deep learning has challenged conventional approaches to machine learning and statistics. Specifically, deep neural networks are highly overparameterized models with respect to the number of data examples and are often trained without explicit regularization. Yet they achieve state-of-the-art generalization performance. A thorough theoretical understanding of the unreasonable effectiveness of deep networks (and other overparametrized models) is still lacking. Previous work suggested that an implicit regularization is occurring in neural-networks via an implicit norm minimization; in particular, the minimization of the (generalized) norm was conjectured to be a by-product of the “optimizer”, the method by which the network is trained (i.e., stochastic gradient descent (SGD)). However, this was questioned by theoretical and empirical work showing strong evidence to the contrary. Here we propose an entirely different and new approach: instead of assuming that learning-models are uniformly-probable random objects (prior to training), we suggest that the probability space over models is in fact already biased towards simple functions. Here we demonstrate that this simplicity bias is a major phenomenon to be reckoned with in overparameterized machine learning. In addition to explaining the outcome of simplicity bias, we also study its source; following concrete rigorous examples, we argue that (i) simplicity bias easily explains generalization in overparameterized learning models such as neural networks; (ii) simplicity bias and excellent generalization are optimizer-independent, as our example shows, and although the optimizer affects training, it is not the driving force behind simplicity bias; (iii) simplicity bias in pre-trained models, and subsequent posteriors, is universal and stems from the subtle fact that uniformly-at-random constructed priors are not uniformly-at-random sampled; and (iv) in neural network models, the biasing mechanism in wide (and shallow) networks is different than the biasing mechanism in deep (and narrow) networks. Thus, while (ii) and (iv) take issue with current theory that focus on wide networks and SGD for explaining generalization, another path is suggested by (i) and (iii).