Gradient approximations in derivative-free optimization
Most of modern ML is heavily reliant on gradient descent and its variants. We have some loss function we want to minimize; if the loss function we actually care about is not differentiable, we modify it until we have something that is. Then, we minimize it; if the loss function is non-convex, we don't worry too much about it. Typically, we do something like stochastic gradient descent (SGD) on a loss function that corresponds to empirical risk minimization (ERM). At each optimization step, we have a noisy approximation of the loss function and its gradient, using a random sample from our dataset. We hope that this helps us overcome non-convexity issues. Actually, we typically don't do vanilla SGD; we use techniques like momentum and Adam that account for the curvature of the loss function. These techniques utilize our estimates of the gradient from previous optimization steps. On rarer occasions where we don't have lots of samples, we skip the “stochastic” part of SGD; and we often use approximate Newton methods like L-BFGS which account for curvature (Hessian), also using gradients from previous steps.
But there's a subfield of ML that basically defines itself by the fact that it doesn't use gradients: black-box optimization. Why not use gradients? First, maybe the inputs to your loss function are discrete objects that can't even somehow be represented as real-valued vectors. Second, maybe you don't have a training dataset to perform ERM over; after each optimization step, you'll actually do an experiment and acquire data. Unlike ERM, your data isn't i.i.d., so you have to worry about collecting inputs that reasonably cover the possible space of inputs. This is the active learning / reinforcement learning scenario. Third, maybe you're actually worried that your loss function is so terribly non-convex that even the “stochastic” in SGD will leave you stuck in a sub-optimal local minimum. These examples basically cover the reasonable reasons to do black-box optimization: either (first case) you can't have a gradient, or (second and third cases) you need more than the gradient (and its history). So you need a model of what your loss function might look like over the whole input space.
But you might have unreasonable reasons. Maybe your loss function is not differentiable, and you don't want to bother modifying it to make it so. Or maybe your loss function is differentiable, but you don't want to do the algebra to figure out the gradient. This was actually a common, reasonable reason before the invention of auto-differentiation. And it's also a non-trivial reason that researchers started studying derivative-free optimization. Now though, there is autodiff. (But maybe, you're just really unlucky and you have something that PyTorch auto-diff can't derive, or is just really slow at computing. This is possible, but unlikely.) Note that if your reason is unreasonable, you probably could have a gradient if you tried harder, and it would be all you need ™. Also note that we approximate the non-smooth 0-1 loss not only because it is non-smooth, but also because it's not a very good classification loss; there's a statistical reason we use cross-entropy loss not 0-1 loss for early stopping in deep learning. So research into derivative-free optimization in such cases is also unreasonable — a relic of the bad old days before auto-diff.
Or is it? In a recent paper, A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization (2022), a bunch of very reasonable people (Berahas, Cao, Choromanski, & Scheinberg) studied this very unreasonable problem. Their analysis is groundbreaking, and IMO it will likely have huge implications on the aforementioned reasonable problems.
To explain why, I first need to justify my claim that derivative-free optimization on unreasonable problems is actually unreasonable. Let's say you're minimizing \(\phi(x)\) where parameter \(x\) is \(n\)-dimensional. Vanilla gradient descent (under reasonable conditions) has linear convergence. But for approximate gradient descent to converge, the approximation error must go to 0 as iterate \(x\) converges to \(x^{*}\). In particular, we require \[|\hat{\nabla}\phi(x) – \nabla \phi(x)| \le | x – x^{*}|.\]If the gradient \(\nabla\phi(x)\) is Lipschitz \(L\)-continuous, then this means we require that \[|\hat{\nabla}\phi(x) – \nabla \phi(x)| \le |\hat{\nabla}\phi(x)|/L,\] where the hat-symbol denotes our approximation of the gradient. This is not nice for derivative-free optimization. Typically, with derivative-free optimization, you would use some variant of finite-difference descent (FDD). Finite difference has error \(\sim O(\sqrt{n})\), and computation cost per iteration \(O(n)\), so it gets really bad in high dimensions. Furthermore, recall that FDD (like gradient descent) — the vanilla versions at least — ignores information from previous iterates. But estimates of the gradient from previous steps are helpful, for more accurate estimation of local gradient when function is noisy, and for estimation of local curvature for faster convergence. And estimates of the function value from previous steps are also helpful, when you want global optimization and are worried about getting stuck in local optima.
So what does this paper accomplish? Recall that FDD has cost per iteration \(O(n)\), because you perturb each \(i\)th coordinate in \(x\) individually and evaluate the function \(\phi(x+\delta e_{i})\) to estimate a single element of the gradient. The authors propose performing linear interpolation instead. You instead evaluate the function at \(N\) points around \(x\), and use linear interpolation to estimate the gradient. The only requirement is that the set of \(N\) points needs to be full-rank, and that the points are in the neighborhood of \(x\). Due to these minimal constraints, you can actually reuse function evaluations from previous steps! And you tend to slow down as you converge, so you'll be able to reuse more of those previous evaluations as you converge.
The proposed approach has some nice benefits for an unreasonable problem, but I think the main long-term impact of this paper will be on reasonable problems. In particular, I think the proposed method will have major implications for reasonable problems where you do have the gradient, but need more than the gradient (and its history), because you care about the global input space. In these settings, black-box optimization methods basically don't use gradient information, because they need more than the gradient. This is incredibly suboptimal! But, black-box optimization methods do retain a history of function evaluations, and the proposed method allows you to estimate an approximate gradient using those evaluations. In the future, I expect black-box optimizers to use this approximate gradient to obtain major improvements in convergence.
Link to paper: https://link.springer.com/article/10.1007/s10208-021-09513-z