calvin mccarter

machine learning, computer science, and more

What follows is an excerpt from Albert Jay Nock’s 1918 essay, Advertising and Liberal Literature. Nock describes how the London Mercury, founded in 1682 as an anti-royalist newspaper, survived and thrived on advertising revenue. Then, as now, pharmaceutical marketing was especially lucrative for the media business.


In the eighteenth issue appears our true friend, our faithful stand-by, the sheet-anchor of newspaper advertising—the patent-medicine man. He makes his initial bow modestly, with a gentle panegyric on the virtues of Spruce Beer, a medicinal drink. A few issues later, however, namely, on September 19, 1682, he comes forth in all his war-paint and feathers in praise of the True Spirit of Scurvygrass.

Many imagine that the psychology of advertising is a modern discovery and that all the tricks of the trade have been worked out of whole cloth in the last quarter of a century or some such matter. To such I earnestly recommend a careful analysis of the Mercury's advertisement of the True Spirit of Scurvygrass. It will encourage them by showing that even if we are now no better than we ought to be, we are at all events no worse than them of old time.

First, the True Spirit of the Scurvygrass is offered to a suffering public because “all are troubled with the Scurvy more or less.” This is an interesting statement, and calculated to start the guileless prowling for symptoms. It has a good force of suggestion; we have all perused more modern advertisements similarly equipped—yea, and in our own flesh have felt each horrid exponent and token rise responsive to the roll-call! Next follows a trade-mark warning, and a plain hint of the prevalence of rebating, or giving dealers a rake-off for pushing one's goods:

Many for Lucre's sake make something which they call Spirit of Scurvygrass, etc., and to promote it both in Town and Country give threepence or a Groat in a Glass to such as will boast and cry it up and dispraise far better than what they sell.

Beware of imitations! Refuse substitutes! None other is genuine! There is nothing particularly new about this, either; we have heard of it before, even to the rebating.

Then follows a courteous and ingenious effort to break the news gently, for which everyone is properly grateful, of course, but yet in spite of it—in spite of the tender solicitude for the Meaner sort, in spite of the transparent purity of the designs upon the Rich in behalf of their Poor Neighbours— one can not help noticing that this remedy was sold at what appears, for those days, a rousing price:

In order that the Meaner sort may easily reach it and the Rich be induced to help their Poor Neighbours, it is ordered to be sold for Sixpence a Glass.

About 1706 the patent-medicine ads begin to crowd all others out of the newspapers—a sure indication that they could and did pay a higher rate. No wonder! No wonder, either, that they were the only ads to survive the imposition of the devastating tax on advertisements some six years later. The True Spirit of Scurvygrass at sixpence a throw in a country where all are troubled with the Scurvy more or less, must have been a moneymaker. Its extremely wide range of therapeutic virtue also no doubt helped its sale. It would cure anything—anything. When the advertiser gets really warmed up to his work he rises to the strain of Dr. Dulcamara in the Elisir d'Amore:

Upon trial you will perceive this Spirit to root out the Scurvy and all its Dependents; as also to help Pains in the Head, Stomach, Shortness of Breath, Dropsies, lost Appetite, Faintness, Vapours, Wind in any Part, Worms, Itching, Yellowness, Spots, etc. Loose Teeth and Decayed Gums are helped by rubbing them with a few drops, as also any Pain in the Limbs….

And so forth and so on. A dose of the True Spirit was a potshot at the whole category of ills that flesh is heir to. If it didn't get what it went after, it would bag something else. It never fired any blank cartridges.

The True Spirit of Scurvygrass was first advertised in the Mercury on September 19, 1682. In the next issue, September 22, under an ad for a lost gold watch, appears an ad of imposing length—a whole half column of it—proclaiming—

the Old and True Way of Practicing Physick, revived by Dr. Tho. Kirleus, His Majesty's Sworn Physician in Ordinary, presented by the Rt. Hon., the Earl of Shaftesbury, and approved by the most competent judges of the Art, the College of Physicians, under their Hands and Seal.

Thus it appears that, like his latter-day brethren who advertise, Dr. Tho. Kirleus was “a graduate physician in regular standing.” But whatever his professional status may have been, Dr. Tho. was a master of the art of advertising. Within the space of forty-two words—only forty-two words—this remarkable man manages to crowd nearly every trick of the modern medicine-monger:

he gives his Opinion for nothing to any that writes or comes to him, and safe Medicines for little, but to the Poor for Thanks; and in all Diseases where the Cure may be discerned, he expects nothing until it be cured.

Analyze this prospectus. Consultation gratis; consultation by mail; “harmless vegetable remedy”; free treatment for those unable to pay; no cure, no pay. Only one thing is missing; and it is supplied in the very next sentence by the swift and masterly hand of Dr. Tho.:

Of the Gout he cured himself ten years since, when crippled with Knots in his Hands and Feet, but now able to go with any Man of his age ten or twenty Miles.

There we have it! That last touch rounds out the advertisement, makes it perfect, and establishes an open channel and communication with the enterprise of our modern age! “One who has suffered from rheumatism for seventeen years, etc., etc., will send by mail, etc., etc.” How pleasant and restful and thoroughly at home it makes one feel to be rewarded with finds like this among the dust and ashes of the lamented past, before the era of commercialism had set in!

A fascinating exchange happened on the latest All In podcast: the besties started talking about personal servers as the future architecture of AI deployment.

https://www.youtube.com/watch?v=5cQXjboJwg0&t=2355s

On the surface level, it would appear that personal servers have little to do with AI and nothing to do with the VCs who invest in AI. But the connection becomes more apparent when we ask, “Why did software move to the web in the first place?” The answer is twofold:

  1. A lot of what one does on the computer involves other people: responding to comments by your daddy on Facebook, buying a new Scrub Daddy sponge on Amazon, or watching the latest Call Her Daddy podcast on Youtube or Spotify. (I apologize for cracking cringey dad-jokes on this Father's Day.)

  2. A lot of what one does on the computer requires a level of intelligence that your computer does not have. When you used a desktop application which you installed via CD, you were limited by the fixed amount of intelligence embedded in the code of the software on that CD. When you use a SaaS product delivered via the web, you are utilizing the entire, dynamic intelligence of the corporate team that delivers the software, who are continuously fixing bugs and adding features as needed.

AI has the potential to fix the second problem. If AI achieves its prophesied power and if such powerful AIs can be run locally, it will unleash the power of your personal computer. More precisely, it will unleash the power of apps that run on your personal computer, either by acting as your personal assistant to call the APIs of those apps, or by acting as an “intelligence forklift” that those apps can call to make their own behavior more intelligent.

AI does not have the potential to fix the first problem. (Well, it will have the potential to fix the first problem, to the extent that you stop wanting to interact with other people. But let’s ignore the Snow Crash scenario where you spend all your time with your AI waifu, instead of interacting with your dad, washing your dishes, and watching celebrity gossip.) And therein lies the rub for all the VCs who want to invest in AI. As long as your interactions with other people are mediated by platform oligopolies, those platform oligopolies will have exclusive access to the resources (the data and the networks) needed to build and sell the AIs that solve the second problem.

So, while the dream of personal servers is as enchanting as it’s ever been, the reason the All In besties are talking about it right now is that they’re engaged in a necessary bit of wishful thinking. If personal servers do not become a thing, the value produced by the AI revolution will be captured primarily by platform incumbents, leaving only the scraps for startups and VCs — and even less for the average user. (Finbarr Timbers — previously @ DeepMind, now @ Midjourney — has an excellent new essay describing the technical basis for these dynamics.) So VCs in AI have to tell themselves that personal servers will win, in order to justify the belief that AI promises 100x or even 10x returns.

This raises the question of whether personal servers are the future, or whether they will always be the future — a digital Brazil, if you will. This leads to the question of Urbit, because — for reasons that are not entirely clear — since its conception in 2002 as a one-man art project until today, it has strangely existed in a market with no competition. The key question hovering over Urbit as a platform can be posed, analogously to the “AI takeoff” question, as a timeline question: will personal servers take off in a few years, several years from now, or never?

If personal servers take off in a few years, this means that Urbit will have won, and that it will have won quickly enough to empower AI personal servers. If personal servers to take off within a few years, there is little time for an Urbit competitor to catch up and take off, and thus Urbit (as the future victor) has only a few engineering scalability and PMF problems remaining which need to be solved. After maybe 1-2 more Kelvin updates of performance improvements, and 1-2 killer apps, Urbit will go viral, and then will be well-placed as a platform for AI-enabled startups and AI-enabled users to seize the means of computing.

If personal servers take off several years from now, this could mean that Urbit will not succeed within a few years. This suggests that there remain a few key problems requiring novel ideas, or that requiring developers to learn a new language was a bridge too far. This in turn suggests that the future champion of personal servers will be a different platform that finally fully solved the personal server “constraint satisfaction problem”, though likely building on the innovations of Urbit. In this possible future, Urbit would likely have a fate similar to that of Twitter and Friendster, known as an innovator that lost due to first-mover disadvantage. (Another apt comparison would be the Torch deep learning framework, whose key innovations inspired PyTorch, but whose popularity was stymied by the developers’ choice of the Lua language.) This scenario leaves the future very uncertain for user-centric AI. If the deployment phase of AI takes a surprisingly long time, this leaves plenty of time for user-centric AI to establish roots. Or if AI takeoff is fast, then user-centric AI will wind up as no more than VC wishcasting.

If personal servers never take off, then this means that personal servers were doomed from the start due to unsolvable incentive structure problems, or that their fate was sealed by a series of historical contingencies. In either case, this is good news for any investor who is sufficiently diversified among the large tech incumbents. Whether this is good news or bad news for a technology user is left as an exercise to the reader.

The educated man, as some still think, is one whose existence is not isolated in the present, whose intellectual and emotional life is consciously joined to the deep currents of evolution, moving from the far past to the invisible future. History is a large part of such an education, and the modern languages may claim their share. But the source and fountain of it all is that classical world in which lie the beginnings of our civilisation. He who can trace his intellectual pedigree back to those origins is among scholars what the aristocrat of ancient family is in society. His taste does not fluctuate with the passing whims of the hour, for his imagination is schooled to contemplate things in long duration. He loses his facile admirations and acquires judgment; his delight in beauty is still and deep. “Sir,” said Dr. Johnson once to Boswell, “as a man advances in life he gets what is better than admiration—judgment—to estimate things at their true value.” To be trained in the classics is to graft the faculty of age on the elasticity of youth. The flimsy arguments of fanatics and charlatans break on such a man without effect, for he knows the realities of human nature, knows what is permanent and what is ephemeral.

— Paul Elmer More

“The Value of Academic Degrees,” The Bookman, Vol. XXIII, 1906.

I'm pretty bullish on 3d printing. This bullishness seems to be spreading, with recent articles in the Wall Street Journal and The Guardian proclaiming that 3d printing is finally ready to transform the world. Before expanding on this, it’s important to clarify what “3d printing” means. Technically, “3d printing” specifically refers to additive manufacturing as an alternative to traditional subtractive manufacturing methods. Traditionally, parts have been made subtractively. A manufacturer will typically buy raw materials that have been prepared into larger, standardized forms; they will then cut these (with CNC milling machines or laser cutters) into the parts for their manufactured products. In contrast, “3d printing” usually involves a material extrusion method such as fused deposition modeling where the raw material hardens in its final form. However, I’ll also be using “3d printing” as a synecdoche for a variety of methods where you bring manufacturing of parts in-house by having an automated system which manufactures physical objects from CAD designs.

First, I don't think it's fully appreciated that manufacturers are not simply an intermediary between raw materials & components and finished goods: they are also an intermediary between suppliers and vendors who are difficult to work with, and customers who expect a nice and simple purchasing experience. Just as enterprise SaaS has closed the gap between pleasant consumer software and unpleasant business software, 3d printing is trying to close the experience gap. These days, companies are more serious about closing the experience gap, partly because the traditional experience has gotten worse, and partly because the alternative (3d printing and related methods) has gotten better. Setting aside the Covid-related supply chain issues, there are longer-term issues with the traditional system, due to rising wages in China and retiring Boomers in the US.

Second, 3d printing has improved in terms of the variety of parts which can be produced with high quality and reliability. This is especially due to “network effects” among the different methods and materials. In addition to plastics, one can now do 3d printing of certain metals (eg Desktop Metal) and carbon fiber (eg Markforged). Another example is how Rack Robotics has an electrical discharge machining (EDM) system for cutting through solid metal, and their system incorporates (and can be further customized using) 3d printed plastic parts. They can also now fill in gaps in their in-house capabilities (eg low-volume custom parts where unit costs matter less) by buying from 3d printing “foundries” where you simply upload your CAD file. I've done this as a hobbyist for metal parts that my Creality Ender can't make, and it's mind-blowingly simple. Finally, the rise of the open-source hardware community (eg Printables and Thingiverse for designs, LaserWeb for generating G-code instructions from CAD designs, and numerous forums where one can ask for help) makes it so much easier to bring manufacturing in-house.

Regarding 3d printing vs machining, the bear case for 3d printing is that it's nearly reached its asymptotic limit of adoption. And here we get back to the additive versus subtractive manufacturing distinction. Strictly speaking, the PowerCore EDM system of Rack Robotics is not 3d printing, even though it uses 3d printers for its kinematics, because it is subtractive in nature. And there are some fundamental tradeoffs between additive and subtractive manufacturing. The recent article in The Guardian correctly notes that additive manufacturing has lower material costs due to the absence of wasted material. However, it has an energy cost roughly proportional to the volume of the produced part. Milling, laser cutting, and EDM approaches to machining have energy cost roughly proportional to the surface area of the produced part, while ECM has energy cost roughly proportional to the volume-removed of the produced part. (The production time varies similarly to the energy cost.) If all these approaches equalize in terms of quality and automatability, what's left are material costs and energy costs. And one can see that additive and subtractive approaches will have their respective niches.

Furthermore, 3d printing currently enjoys an automation advantage, because it has been developed more recently by innovative startups, whereas machining tool companies are behind the times. One would expect machining systems to eventually support automation equally well, and thus for this advantage to erode. There's also an interesting AI angle to automation. CNC machining requires humans to generate G-code from a CAD design, whereas 3d printers generally require only the CAD design. However, ChatGPT can already generate G-code, and I expect rapid improvements in conditionally generating G-code from CAD drawings.

Overall, I’m extremely bullish on automated, vertically-integrated manufacturing. But I think there’s nuance on the extent to which this will involve additive versus subtractive approaches.

Disclosure: I am long SHPW, DM, MKFG, and an angel investor in Rack Robotics.


April 15, 2023 update:

Further evidence of the asymptotic advantages of machining is Relativity Space’s shift from its 3d-printed Terran 1 rocket to its traditionally-manufactured (and much larger) Terran R rocket:

Since Ellis unveiled plans for Terran R two years ago, the rocket’s design has continued to evolve. But Relativity’s update on Wednesday features its most dramatic change yet, with the 3D-printing specialist incorporating an aluminum alloy into the rocket’s initial models through manufacturing “tank straight-section barrels” – a practice that is more traditionally common in aerospace.

Relativity made a name for itself with its 3D-printing approach to manufacturing rockets, building massive additive manufacturing machines. The company 3D-printed about 85% of the mass of its Terran 1 rocket, and previously planned to get that number above 90%. Ellis declined to specify what percent of Terran R will now be 3D-printed in the company’s new “hybrid manufacturing approach,” emphasizing instead that the shift is to prioritize its timeline to first launch.

“We’re using printing everywhere else strategically to really reduce the vehicle complexity,” Ellis said. “We can actually take the more simple, straight sections of the vehicle and build them traditionally and not have a huge decrement to the amount of difficulty that it is to build.”

Nathan Tankus recently wrote about the quandary faced by central banks and regulators. The essential problem is that when the government saves the financial system with a bailout, it sets a bad precedent. After the crisis, financial actors will be incentivized to engage in risky behavior, believing that a future bailout is likely. So, once the crisis is over, legislators often attempt to reduce this moral hazard by restricting the central bank from repeating its past actions in future crises. But these limitations can make it more difficult for central banks to contain future crises.

This is a really tricky problem, but we can find inspiration from outside the realm of finance. In particular, it's interesting to look at situations where people did bad things in the past that need to be deterred in the future, but punishing those people would have too much collateral damage, and it's not even clear where to draw the line about who and what to punish because “Moloch does it.” One example is how China addressed the Cultural Revolution after it was over. To enable economic growth, it needed to be clear that another Cultural Revolution would never be allowed to happen again. But sending this message by litigating responsibility would cause too much turmoil that would also detract from rebuilding the economy, and so it was necessary to basically forgive those responsible and memory-hole the whole episode. The solution in China's case was a pseudo regime change: bringing a Party victim of the Cultural Revolution, Deng Xiaoping, into power and allowing him to wholly restructure economic policy.

Analogously, one might argue that the Fed should do what is necessary to ensure financial stability now — even if it sets bad precedent. But then, in the aftermath, instead of adding new limitations to Fed powers that are both of doubtful credibility (and thus still pose moral hazard) and also possibly overly restrictive during an actual emergency, one should instead create a new institution NotTheFed. Because NotTheFed is not the Fed, the bad precedents of the Fed no longer pose moral hazard, and NotTheFed can credibly pinky-promise to not follow Fed-set precedents. Meanwhile, because NotTheFed is a clean-slate institution, it will not be hamstrung by pesky 13(3) rules during the next crisis. In other words, this pseudo regime change in the regulatory system reduces moral hazard in the boom part of the cycle, while giving NotTheFed more flexibility during the next financial bust.

Of course, the problem with this scheme is that once NotTheFed sets bad precedent, you'll need to dismantle it and repeat the cycle with NotNotTheFed, which (logicians tell us) equals The Fed.

Leonard Cohen | Rock & Roll Hall of Fame

The mysterious, haunting lyrics to Avalanche by Leonard Cohen have received a variety of interpretations. The speaker of the song is said to be a supernatural figure — God or Satan, or a mere human — a jilted lover or a lonely hunchback. But the speaker can instead be understood as a personification of Planet Earth.

Well I stepped into an avalanche It covered up my soul; When I am not this hunchback that you see I sleep beneath the golden hill You who wish to conquer pain You must learn, learn to serve me well.

At first, we think we hear the voice of a person buried under a tumult of mud and rocks. But Cohen will flip around the imagery: it is Earth that has been overwhelmed by the explosion of human civilization, and found itself buried under asphalt and garbage. The comparison between Earth and a hunchback holds in two ways. First, the spherical curvature of the planet is being compared to the arched back of the hunchback. Second, this could be an allusion to Victor Hugo’s Quasimodo. Just as Quasimodo likened himself to an ugly vase which holds beautiful flowers, Cohen alludes to the lifeless ground which supports golden fall foliage. Humanity, which wishes for a happy existence, is urged to therefore reverence our planet.

You strike my side by accident As you go down for your gold The cripple here that you clothe and feed Is neither starved nor cold; He does not ask for your company Not at the centre, the centre of the world.

This stanza is about the environmental degradation of mining, which Cohen also mourns in the climax of Steer Your Way. In that song, he also personifies the strip-mined mountains, alluding to Christ’s death: “They whisper still, the injured stones / The blunted mountains weep / As he died to make men holy / Let us die to make things cheap.” Here, he compares the piercing of the Earth with gold mines to the spear plunged into Jesus’ side during the Crucifixion. The latter half of the stanza is a likely reference to the littered food, clothing, and (rather morbidly) the dead miners left behind underground. Earth insists that such sacrificial offerings were never asked for!

When I am on a pedestal You did not raise me there Your laws do not compel me To kneel grotesque and bare I myself am the pedestal For this ugly hump at which you stare.

The environmental movement is the focus of this stanza. Notably, Avalanche came out in 1971, during the emergence of the modern environmental movement — a year after the first Earth Day, and the same year as the founding of Greenpeace. Humans began to lift Earth up on a pedestal, but Earth does not want humans to feel self-satisfied about this. After all, it is Earth that is the pedestal upon which all humans stand. Cohen also makes reference to how deforestation has made Earth look hideous. This stanza potentially extends the Hunchback of Notre Dame metaphor, comparing Earth’s sufferings to the flogging of Quasimodo.

You who wish to conquer pain You must learn what makes me kind; The crumbs of love that you offer me They're the crumbs I've left behind Your pain is no credential here It's just the shadow, shadow of my wound.

The little that humans do for the planet is but a fraction of the Earth’s bounty that it bestows on humans. That humans suffer does not expiate our guilt; nor does our desire to reduce our own suffering permit us to harm the planet. Human suffering cannot be balanced against the suffering of the biosphere, because humanity is itself a part of life on Earth — and a small part at that.

I have begun to long for you I who have no greed I have begun to ask for you I who have no need You say you've gone away from me But I can feel you when you breathe.

This is a likely reference to space travel: the first Moon landing was in 1969. Like a jealous lover, the Earth is beginning to feel possessiveness towards a humanity that dreams of leaving the planet behind. Why can Earth feel you when you breathe? From the Earth’s giant perspective, humans in space are still relatively close.

Do not dress in those rags for me I know you are not poor You don't love me quite so fiercely now When you know that you are not sure It is your turn, beloved It is your flesh that I wear.

Earth speaks again to environmentalists, who metaphorically wear sackloth and ashes to repent for humanity’s misdeeds. Earth doubts their sincerity and conviction; they’re still enjoying their First World lifestyles.

The final two lines are of special interest, because Leonard Cohen modified them during some of his later live performances, such as in Zurich 2013, to “This is your world beloved / It is your flesh that I wear.” Earth tells us that it is our world; it loves us and it belongs to us. The song concludes with a powerful image: just as we humans cover our bodies with clothing, the planet has clothed itself with us — with human civilization.

I think Adam Mastroianni has done a fine job critiquing peer review as a barrier to entry and as a service to paper-readers, but that's not how it really functions in practice, for better or worse. Peer review is a system by which random, anonymous paper-writers are able to force other researchers to read their work. In other words, peer review is a service to paper-writers. It is especially a service to paper-writers lacking prestigious affiliations and notable previous work — authors whose papers would go unread if they just posted on arXiv. So in this sense, peer review is a path to entry, not a barrier to entry.

This is why being a reviewer is so painful: it imposes a uniform prior for paper quality, even though we know that some researchers do much better work than others, and we would prefer to focus on reading their work. This is why even peer-review is not a guarantee of quality: the process ignores said priors, and also reduces the incentive for developing a reputation as someone who only submits quality work.

At the same time, blinded paper reviewing is probably the only way to identify good work coming from new and unknown researchers. In other words, there is a precision-recall tradeoff to peer review. If reviewers were not forced to read papers they otherwise would not, precision would increase but recall would decrease. Yitang Zhang’s recent groundbreaking work on prime number gaps is a striking example of this. Because he hadn’t published anything since 2001, had no prior background in number theory, and held a lowly lecturer position, his paper received attention only because the reviewers at Annals of Mathematics were forced to read it.

Furthermore, as a prominent researcher with Twitter clout, it is no suprise that Mastroianni had no difficulty getting 110+ comments in reply to his Substack post about bias in human imagination. But if a grad student at Podunk State University wrote the exact same thing, it would quite easily go unread. This means that graduate students will, rather than spending time to fulfill the demands of peer reviewers, instead spend time trying to obtain Twitter clout. The easiest way to obtain Twitter clout is to write hot takes and engage in herding behavior on politicized topics. Thus, removing peer review will hurt recall further, by making it harder for boring, non-hot-taking researchers to disseminate their findings.

Most of modern ML is heavily reliant on gradient descent and its variants. We have some loss function we want to minimize; if the loss function we actually care about is not differentiable, we modify it until we have something that is. Then, we minimize it; if the loss function is non-convex, we don't worry too much about it. Typically, we do something like stochastic gradient descent (SGD) on a loss function that corresponds to empirical risk minimization (ERM). At each optimization step, we have a noisy approximation of the loss function and its gradient, using a random sample from our dataset. We hope that this helps us overcome non-convexity issues. Actually, we typically don't do vanilla SGD; we use techniques like momentum and Adam that account for the curvature of the loss function. These techniques utilize our estimates of the gradient from previous optimization steps. On rarer occasions where we don't have lots of samples, we skip the “stochastic” part of SGD; and we often use approximate Newton methods like L-BFGS which account for curvature (Hessian), also using gradients from previous steps.

But there's a subfield of ML that basically defines itself by the fact that it doesn't use gradients: black-box optimization. Why not use gradients? First, maybe the inputs to your loss function are discrete objects that can't even somehow be represented as real-valued vectors. Second, maybe you don't have a training dataset to perform ERM over; after each optimization step, you'll actually do an experiment and acquire data. Unlike ERM, your data isn't i.i.d., so you have to worry about collecting inputs that reasonably cover the possible space of inputs. This is the active learning / reinforcement learning scenario. Third, maybe you're actually worried that your loss function is so terribly non-convex that even the “stochastic” in SGD will leave you stuck in a sub-optimal local minimum. These examples basically cover the reasonable reasons to do black-box optimization: either (first case) you can't have a gradient, or (second and third cases) you need more than the gradient (and its history). So you need a model of what your loss function might look like over the whole input space.

But you might have unreasonable reasons. Maybe your loss function is not differentiable, and you don't want to bother modifying it to make it so. Or maybe your loss function is differentiable, but you don't want to do the algebra to figure out the gradient. This was actually a common, reasonable reason before the invention of auto-differentiation. And it's also a non-trivial reason that researchers started studying derivative-free optimization. Now though, there is autodiff. (But maybe, you're just really unlucky and you have something that PyTorch auto-diff can't derive, or is just really slow at computing. This is possible, but unlikely.) Note that if your reason is unreasonable, you probably could have a gradient if you tried harder, and it would be all you need ™. Also note that we approximate the non-smooth 0-1 loss not only because it is non-smooth, but also because it's not a very good classification loss; there's a statistical reason we use cross-entropy loss not 0-1 loss for early stopping in deep learning. So research into derivative-free optimization in such cases is also unreasonable — a relic of the bad old days before auto-diff.

Or is it? In a recent paper, A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization (2022), a bunch of very reasonable people (Berahas, Cao, Choromanski, & Scheinberg) studied this very unreasonable problem. Their analysis is groundbreaking, and IMO it will likely have huge implications on the aforementioned reasonable problems.

To explain why, I first need to justify my claim that derivative-free optimization on unreasonable problems is actually unreasonable. Let's say you're minimizing \(\phi(x)\) where parameter \(x\) is \(n\)-dimensional. Vanilla gradient descent (under reasonable conditions) has linear convergence. But for approximate gradient descent to converge, the approximation error must go to 0 as iterate \(x\) converges to \(x^{*}\). In particular, we require \[|\hat{\nabla}\phi(x) – \nabla \phi(x)| \le | x – x^{*}|.\]If the gradient \(\nabla\phi(x)\) is Lipschitz \(L\)-continuous, then this means we require that \[|\hat{\nabla}\phi(x) – \nabla \phi(x)| \le |\hat{\nabla}\phi(x)|/L,\] where the hat-symbol denotes our approximation of the gradient. This is not nice for derivative-free optimization. Typically, with derivative-free optimization, you would use some variant of finite-difference descent (FDD). Finite difference has error \(\sim O(\sqrt{n})\), and computation cost per iteration \(O(n)\), so it gets really bad in high dimensions. Furthermore, recall that FDD (like gradient descent) — the vanilla versions at least — ignores information from previous iterates. But estimates of the gradient from previous steps are helpful, for more accurate estimation of local gradient when function is noisy, and for estimation of local curvature for faster convergence. And estimates of the function value from previous steps are also helpful, when you want global optimization and are worried about getting stuck in local optima.

So what does this paper accomplish? Recall that FDD has cost per iteration \(O(n)\), because you perturb each \(i\)th coordinate in \(x\) individually and evaluate the function \(\phi(x+\delta e_{i})\) to estimate a single element of the gradient. The authors propose performing linear interpolation instead. You instead evaluate the function at \(N\) points around \(x\), and use linear interpolation to estimate the gradient. The only requirement is that the set of \(N\) points needs to be full-rank, and that the points are in the neighborhood of \(x\). Due to these minimal constraints, you can actually reuse function evaluations from previous steps! And you tend to slow down as you converge, so you'll be able to reuse more of those previous evaluations as you converge.

The proposed approach has some nice benefits for an unreasonable problem, but I think the main long-term impact of this paper will be on reasonable problems. In particular, I think the proposed method will have major implications for reasonable problems where you do have the gradient, but need more than the gradient (and its history), because you care about the global input space. In these settings, black-box optimization methods basically don't use gradient information, because they need more than the gradient. This is incredibly suboptimal! But, black-box optimization methods do retain a history of function evaluations, and the proposed method allows you to estimate an approximate gradient using those evaluations. In the future, I expect black-box optimizers to use this approximate gradient to obtain major improvements in convergence.


Link to paper: https://link.springer.com/article/10.1007/s10208-021-09513-z

In a philosophical moment after Twitterers voted for his ouster, Elon Musk tweeted, “Those who want power are the ones who least deserve it.” Musk’s statement exemplifies modern society’s desire for a Tolkienic Hero, a figure who reluctantly holds the reigns of power to change the course of history. As Tanner Greer pointed out in his essay introducing the idea, this heroic ideal is popular today, but was an aberration in the past. How did this vision of leadership imprint itself in our minds, and how should we evaluate it?

As Greer discusses, the slow collapse of the Ancien Regime in the 19th through 20th centuries prepared the way for the new Tolkienic ideal. But, while his literary innovation may seem vaguely liberal or left-wing, Tolkien himself was a more of a nostalgic conservative. And Tolkien’s perspective was likely inspired by the famous aphorism of his fellow Catholic Englishman Lord Acton, that “Power tends to corrupt and absolute power corrupts absolutely.” In fact, the origins of this idea goes back further still, to Tory statesman Edmund Burke. In Reflections on the Revolution in France, Burke argued that inherited property and hereditary titles of nobility bestow an inertness and benevolence upon their ungrasping recipients:

Nothing is a due and adequate representation of a state that does not represent its ability as well as its property. But as ability is a vigorous and active principle, and as property is sluggish, inert, and timid, it never can be safe from the invasion of ability unless it be, out of all proportion, predominant in the representation. It must be represented, too, in great masses of accumulation, or it is not rightly protected. The characteristic essence of property, formed out of the combined principles of its acquisition and conservation, is to be unequal…. The power of perpetuating our property in our families is one of the most valuable and interesting circumstances belonging to it, and that which tends the most to the perpetuation of society itself. It makes our weakness subservient to our virtue, it grafts benevolence even upon avarice. The possessors of family wealth, and of the distinction which attends hereditary possession (as most concerned in it), are the natural securities for this transmission. With us the House of Peers is formed upon this principle. It is wholly composed of hereditary property and hereditary distinction, and made, therefore, the third of the legislature and, in the last event, the sole judge of all property in all its subdivisions.

Thus, a Tory conservative could argue that pre-modern societies unintentionally placed power in the hands of those who did not ask for it, whereas modern societies idealize empowering those who do not seek power, while actually empowering those who do seek it. Rene Girard theorized that kingship evolved out of sacrificial victimhood. According to this theory, sacrificial victims were pre-selected to be killed in case of misfortune, and in the meantime were given kingly responsibilities. While such a practice was likely not universal, archeological evidence, most notably from the “bog bodies” of Ireland, suggests that the phenomenon was real. Another interesting historical example of “Tolkienic heroism in practice but not in theory” would be how the Early Church sometimes bestowed the bishop’s staff upon unwilling individuals like St. Augustine and St. John Chrysostom.

These are interesting examples, because hereditary parliaments, sacrificial kingships, and compulsory bishoprics are all things that are basically unthinkable today. It should surprise you that the spread of the Tolkienic Ideal has coincided with its becoming unthinkable in practice. Which is the cause, and which is the effect? And what should we make of this?

One possible explanation relies on the distinction between stated versus revealed preferences. Societies have become more democratic in the last centuries. People think they want a ruler who isn’t too keen on power. But when the populace actually encounters someone like “low energy” Jeb Bush, they don’t vote for him. And in the corporate world, high-energy employees don’t want to work for a company with a CEO who feels uncomfortable with power. For example, in Jack Dorsey’s Twitter, his hands-off approach made it a less desirable employment destination for top engineers, who voted with their feet to work under Mark Zuckerberg’s more energetic leadership at Facebook.

Another possibility is that modern society has more competition and complexity. Complex societies need more competent leadership, and competence depends on being actually passionate for the job. The competition and warfare between nation-states favored countries with energetic, passionate leaders like Napoleon, Gladstone, Disraeli, and Bismark. Hereditary monarchs tended to be bumbling and incompetent — lovable losers at best. In simple, zero-sum tribal societies, great leaders were those with the best intentions. In complex countries or companies, especially those facing internal discord or external competition, people care less about character and more about winning. Whereas modern democracy was the explanatory variable in the previous explanation, modern meritocracy is the causal factor for this hypothesis. Political competition and polarization in Britain attracted the impressive likes of Gladstone and Disraeli into politics. Internal competition for promotions and leadership roles at Twitter meant that, given Dorsey’s hands-off approach to moderation and everything else at the company, the vacuum was quickly filled by power-loving staff.

A third possibility is that in contemporary cultures, people who do not want to be burdened by power are able to more easily avoid it. This hypothesis has two sub-hypotheses. Firstly, many people who sought power in the past possibly did not actually want the power, but maybe the access to more women, the fineries of life, and shiny metal objects. In the modern era, these perks of power can be attained in other ways without the stresses of power. Jeff Bezos could have more power if he had taken over Twitter or taken an active interest running The Washington Post. But he seems happier with a new yacht, a new wife, and a newly impressive physique. Secondly, for those who do have power, but don’t want it, abdication is less shameful and more safe. We lack a way to force or shame people to be leaders if they don’t want to, leading to uniquely modern stories like those of King Edward VIII, Princess Mako, and Prince Harry. And when Victoria and Elizabeth II gradually reduced the British Crown’s political influence over their tenures, they were rewarded with admiration. It’s also safer to abdicate today. Pope Emeritus Benedict has lived for nearly a decade after retirement; at the time of Pope Alexander VI, this survival would be rather improbable.

I believe all the above explanations are at least partially true, but there is another possibility, which flips the cause-and-effect relationship. What if, partly due to the spread of the ideal of the Tolkienic Hero, the nature of power has itself changed? According to this hypothesis, the Tolkienic Hero ideal has encouraged us to shift power towards bureaucratic procedures and memetic processes, in the hope of preventing the emergence of Tolkienic Villains. This shift precludes people as individuals from holding power, which prevents the emergence of heroes, Tolkienic or otherwise.

Where power is bureaucratic, then power offers different pleasures from where it is personal. In the latter, there is the joy of experimentation and adventure. In the former, the only pleasure is the pleasure of participating in power, by acting as an invisible cog in a machine. The bureaucratic positions which participate in power attract the type of people who want to participate in power but who also want to avoid the self-perception of power, because they adhere to the Tolkienic view of power. These bureaucratic organizations also attract empty suits who want the prestige of formerly-powerful offices, without the glories and burdens of power itself. In stark contrast to Gladstone and Disraeli, short-lived PM Lizz Truss reportedly left Downing Street with the words, “I'm relieved it's all over... at least I've been Prime Minister.”

Where power is memetic, then power also offers a different experience to those it recruits to participate in it, catering to different tastes. Here, power does not seem like power at all, but more like influence. You “like” and “share” but you don’t feel like you’re in charge. Your role in wielding power is very visible, but feels virtual and fake and spontaneous, so you do not need to worry that you have become a Tolkienic Villain.

However, these two modern (perhaps post-modern) systems for accommodating the Tolkienic critique of acquisitive power are very different from the older pre-modern forms of bestowed power. Most importantly, these systems are fundamentally impersonal. Bad individuals, acting as bad individuals, are not in charge. Good individuals, acting as good individuals, are not in charge. People are not in charge.

What does all this imply about the Tolkienic heroic ideal? It means that Tolkien, while he recognized an underappreciated feature of pre-modern power structures, missed the more important question facing modern and post-modern societies. We should not be asking whether power should go to people who want it, or to people who don’t want it, because neither is exactly on the menu today. Rather, we should ask whether power should go to people in the first place. Having answered and acted upon this more existential question, perhaps in the future we will be able to resolve the lesser questions raised by Tolkien’s work.

Detroit Lions coach Dan Campbell was recently asked how he turned around a franchise that’s been losing for decades.

Look, I wish I could put my finger on exactly what it was, but I can’t do that. All I know is — we stayed the course of what we believe in….

Over the last year and a half, it’s been like — not that I’ve ever been in prison — you’re in prison, and you’re plotting your escape. And you know who it’s going to work, you got it all planned out. But you got to have patience; there’s a certain time for everything to get to where it needs to be for you to make your break.

And we made our break. And you gotta go through the sewer now. So we’re going through the sewer now. But you’re going to come out the other end, and you’re gonna be free, and it’s gonna be good, and you know the life you’re gonna live. And that’s kinda where we got to. We’re not all the way there yet, but we stayed true to what we are.

And the guys never wavered — that’s the most important thing. We got really good guys on this team. They believed, and they came back every Wednesday, and just kept going. They just went back to work, back to work. So listen, there’s no magic pill, secret — we just kept going.

Dan Campbell, Lion King