calvin mccarter

machine learning, computer science, and more

Observant people often wonder why mirrors appear to flip left and right, but not up and down. If you search for scientific explanations of this, you’ll read that mirrors don’t flip left and right, but rather forward and reverse. This is correct as far is it goes, but it’s not fully satisfactory. Why do we instinctively feel like mirrors flip left and right? Why do we not instinctively feel like mirrors flip up and down?

The answer is that left and right are relative directions, while up and down are not. If you turn yourself upside down, you still consider “up” to be the direction opposite gravity: you heed your inner ear, not your eyes. There is no non-visual reference for left or right, so as you turn in either direction, you redefine their meanings. If a mirror were to “flip up and down,” it would have to actually depict your head floating above the ground, opposing the absolute direction of earth’s gravitational field.

Not only do people define left and right in relative terms, unlike up and down, we also do something strange with mirrors: we anthropomorphize the image. What we mean when we say that a mirror flips left and right, is that if we cloned ourself and then walked around 180 degrees to face ourselves like the image in the mirror, we would appear left-right-flipped compared to the image. It is our imagination of the image as being our spun-around-and-facing-us clone that caused the mirror to defy our expections.

The full answer to the mirror-flipping question involves both physics (how light reflection works) and psychology (how we think about directions and mirrors).

The softmax function \(\sigma(\mathbf{z})\) maps vectors to the probability simplex, making it a useful and ubiquitous function in machine learning. Formally, for \(\mathbf{\mathbf{z}} \in \mathbb{R}^{D}\), \[[\sigma(\mathbf{z})]_i = \frac{e^{\mathbf{z}_i}}{\sum^D_{d=1}e^{\mathbf{z}_d}}.\] To get an intuition for its properties, we'll depict the effect of the softmax function for \(D=2\). First, for inputs coming from an equally-spaced grid (depicted in red), we plot the softmax outputs (depicted in green): The output points all lie on the line segment connecting (x=0, y=1) and (x=1, y=0). This is because, given 2-dimensional inputs, the softmax maps to the 1-dimensional simplex, the set of all points in \(\mathbb{R}^2\) that sum to 1. Thus, the softmax function is not one-to-one (injective): multiple inputs map to the same output. For example, all points along the line \(y=x\) are mapped to \((0.5, 0.5)\).

More interestingly, for any output (green dot), the inputs (red dots) that map to it are all located along 45-degree lines, i.e. lines parallel to the line \(y=x\). This is not an accident: the softmax function is translation-invariant. Formally, for any \(\alpha\): \[\sigma(\mathbf{z}) = \sigma(\mathbf{z} + \alpha \mathbb{1}).\]

Next, we plot the effect of the softmax function on points already lying within the probability simplex. For inputs (in red) equally spaced between \((0, 1)\) and \((1, 0)\), we plot their softmax outputs (in green): What this reveals is that the softmax shrinks inputs towards the point \((0.5, 0.5)\). Except when the input already represents a uniform distribution over all \(D\) classes, the softmax function is not idempotent.

We can illustrate this further by plotting equally-spaced softmax outputs (as colored points in \(\mathbb{R}^2\)), and then drawing curves with matching colors to depict the set of inputs which map to a particular output. Due to translation invariance, these curves will actually be lines, as depicted below.

Notice that it becomes increasingly difficult to produce high-confidence outputs. For a particular softmax output \((c, 1-c)\), the corresponding set of inputs is the set of all points of the form \((x, x + \log((1-c)/c) \). Even though the outputs are equally-spaced (0.1, 0.9), (0.2, 0.8), and so on, the inputs are increasingly distant when the probability mass is concentrated on either \(x\) or \(y\).

Suppose that we wanted to make the softmax function a bit more idempotent-ish. One way to do this is by using the temperature-annealed softmax (Hinton. et al., 2015): \[[\sigma(\mathbf{z})]_i = \frac{e^{\mathbf{z}_i/\tau}}{\sum^D_{d=1}e^{\mathbf{z}_d/\tau}},\] where \(\tau\) can be thought of as a temperature. Typically, one uses a cooled-down softmax with \(\tau < 1\), which leads to greater certainty in the output predictions. This is especially useful in inducing a loss function that focuses on the smaller number of hard examples, rather than on the much larger number of easy examples (Wang & Liu, 2021).

To visualize this effect, we repeat the previous plot, but now with \(\tau=0.5\): We see that we can now “get by” with smaller changes in the inputs to the softmax. But the temperature-annealed softmax is still translation-invariant.

Code for these plots is here: https://gist.github.com/calvinmccarter/cae597d89722aae9d8864b39ca6b7ba5

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 2.7 (2015).

Wang, Feng, and Huaping Liu. “Understanding the behaviour of contrastive loss.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2021).

Back in 2020-2021 I worked with Robin Hanson et al on a “grabby aliens” model which offered an explanation for why humans are so early in the history of the universe. I recently had the chance to watch Robin’s 2021 presentation to Foresight Institute on grabby civilizations (GCs). In the Q&A session, Adam Brown offered some questions and comments which suggest that the GC model actually works best as a model of false vacuum decay bubbles, which would make the universe increasingly uninhabitable. First, these bubbles, unlike GCs, are naturally devoid of observers, which prevents the self-indication assumption problem of, “Why are we non-GC observers instead of GC observers?” Second, these bubbles naturally propagate at the speed-of-light, so do not require the development of multi-galaxy civilizational expansion technology. Without further ado, what follows is a minimally-edited transcript of the relevant Q&A between Adam Brown and Robin Hanson:

I have a question, but before I get to the question, I want a clarification first. So in your model, we are us, we're not the grabby aliens, despite in your model the grabby aliens vastly outnumber us. So that seems like you're just hypothesizing that.

The key point is, here we are now. We could potentially become aliens for others. But we are not yet grabby; we have not yet reached the stage where we are expanding rapidly and nothing can stop us. But the key assumption is that we might become grabby; and if we do, that would happen within, say, 10 million years. And that means that our date now is close to a date of origin of grabby aliens, because if it happens it would happen soon. That's the key assumption.

But in your model, almost all of the sentient beings that exist are the much larger grabby aliens rather than the earlier civilizations such as ours. So you need to explain why we're not one of those.

We're trying to be agnostic about the ratio between grabby and non-grabby aliens. So there could potentially be all these quiet aliens out there: vast numbers unknown, density unknown. It's hard to say much about them because you can't see them. We're focused on the grabby ones because we can say things about them, but we're agnostic about the relevant ratio there.

But the grabby aliens are occupying large fractions of the universe, so unless they're not sentient, it sounds like they should be more numerous. So of the sentient beings who exist, we're extremely atypical in your model, which seems to be a point against it.

That wouldn't be a crazy conclusion to draw, but you would have to make further assumptions about observers. We're not in our analysis making assumptions about observers. We're not saying that grabby aliens are observers, or that they will produce a density of observers. We're not saying anything about them other than that they make a visible difference, and then you would see them. That's all we're saying.

Let me move on to my next question then, which is perhaps betraying my day job. Let me present an alternative theory for a resolution of the Fermi Paradox that sounds very different from yours but I think ultimately is sort of quite similar. In your grabby alien model, the reason we don't see them is that they excise a large fraction of their future light-cones. Another model is: when civilizations get sufficiently advanced, they run stupid science experiments, and those stupid science experiments cause vacuum decay in the Higgs sector, for example. In this case, there will be a vacuum bubble that will expand out at the speed of light, and really excise the future light-cone of those advanced civilizations. That theory actually has a lot in common with your theory, in the sense that both result in advanced civilizations excising their future light-cones. And all of the evidence that counts in favor of your theory also counts in favor of that theory: all the evidence in terms of the “N-steps in evolution”, the “why are we so early” questions. And it has the additional advantages that you don't need to explain why it expands at the speed of light (because that's just input from theoretical physics that it'll definitely expand at the speed of light if you make a new bubble of a vacuum) and also has the advantage that you don't need to explain why we don't live in one of those bubbles because there's nothing alive in those bubbles — you've destroyed the Higgs vacuum. So there seems to be some sort of commonality between the vacuum decay bubble literature and what you're saying. And it'll be interesting to look back at the bubble nucleation literature in the light of your comments, and see whether they bear on that.

#futurism

The file system is a fundamental part of the API provided by your operating system. Yet it’s also getting long in the tooth. Last year The Verge reported that college professors are struggling to explain the concept of files and file systems to students who grew up on smartphones and clouds:

It’s the idea that a modern computer doesn’t just save a file in an infinite expanse; it saves it in the “Downloads” folder, the “Desktop” folder, or the “Documents” folder, all of which live within “This PC,” and each of which might have folders nested within them, too. It’s an idea that’s likely intuitive to any computer user who remembers the floppy disk.

More broadly, directory structure connotes physical placement — the idea that a file stored on a computer is located somewhere on that computer, in a specific and discrete location. That’s a concept that’s always felt obvious to Garland but seems completely alien to her students.

The concept that data is a thing that is stored in a location is not just an artifact from the era of personal computers and slow 56k internet. It is the essence of computation as such. When Alan Turing proposed the Universal Turing Machine, he created an abstract mathematical model of computation, not a physical object. The API of a Turing Machine (its tape) is equivalent to the API of the Unix filesystem is equivalent to the API of the Python interpreter.

To be able to compute is to be able to arbitrarily read, write, and manipulate data. You cannot compute on your iPhone or on Google Drive. The college freshmen who know only the new locked-down APIs have been deprived of the expansive experience of general-purpose computing. They have known only the narrow experience of restricted consumption. To be deprived of this experience is to lack the intuitions that make it easier to learn programming, and the intuitions that make it possible to envision world-changing innovations applying the power of computation.

So why only two cheers for the file system?

First, while operating systems (thinking especially of Unix here) offer a decent API for reading and writing objects, their API for manipulating objects is painfully bad. Even though the Unix file system offers an amazing API for storing and reusing intermediate results, shell scripts are terrible. So people instead write data-processing programs in Python, but this creates a new set of problems. In data science (especially bioinformatics) storing and reusing intermediate results via the file system helps you save computing time. It also helps you inspect and debug complex pipelines, and recover (partially) when programs are killed by unhandled exceptions.

Second, I fudged about filesystems being “the essence of computation as such.” Alonzo Church invented the \(\lambda\)-calculus, which defines the same set of computable functions as that defined by Turing. Functional programming offers an alternate API for specifying programs, one which may be better suited to networks of interacting users. The Urbit operating system and ecosystem is currently being developed on this premise, and I’m extremely excited about its prospects for making the Internet personal and computable again.

Despite these drawbacks, I believe that the Turing Machine paradigm will endure. Scientific simulations, ML model training, and data analytics will continue to grow in importance. These tasks (which Turing himself helped pioneer) are well-suited to filesystem APIs. Packages like Snakemake are helping unite the benefits of Python and the filesystem. In contrast, customizable yet not Turing-complete systems are inappropriate as end-to-end solutions for non-trivial modeling scenarios. You inevitably bump into the pre-ordained limitations of GUIs, configuration files, and SQL queries, so you end up writing code. To paraphrase Greenspun’s 10th rule, any sufficiently complicated declarative language contains an ad hoc, informally-specified, bug-ridden, slow implementation of \(1/\infty\) of a Turing-complete language.

Due to the beginner-friendly benefits of imperative programming, and the importance of the filesystem for data-processing tasks, familiarity with the filesystem will remain essential.

#programming

An aligned artificial intelligence is safe, but that's not what intelligence is for.

If you are the dealer
I'm out of the game
If you are the healer
It means I'm broken and lame
If thine is the glory then
Mine must be the shame
    ~ You Want It Darker, Leonard Cohen

The most wonderful aspect of the universal scheme of things is the action of free beings under divine guidance.
    ~ Considerations on France, Joseph de Maistre

Give me the liberty to know, to utter, and to argue freely according to conscience, above all liberties.
    ~ Areopagitica, John Milton

#futurism

Suppose you have \(K\) multivariate Gaussian distributions, each of dimensionality \(N\). It turns out that the product of these distributions, after normalization, is also multivariate Gaussian. What is the algorithmic complexity to compute this product?

Let's first assume that each \(k\) of the \( K\) Gaussian distributions is parameterized by mean \(\mu_k\) and covariance \(\Sigma_k\). The product is proportional to a Gaussian with mean \(\mu\) and covariance \(\Sigma\), where

\[\begin{aligned} \mu =& \Big(\sum^K_{k=1}\Sigma_k^{-1}\Big)^{-1} \Big(\sum^K_{k=1} \Sigma_k^{-1} \mu_k\Big), \\ \Sigma =& \Big(\sum^K_{k=1}\Sigma_k^{-1}\Big)^{-1}. \end{aligned} \]

Assuming we have memory to store and reuse all intermediate results, we will need to perform \(K+1\) matrix inversions and to solve \(K+1\) linear systems. Thus, the runtime complexity is \(O(KN^3 + KN^2)=O(KN^3)\).

Now, let's instead assume that each \(k\) of the \( K\) Gaussian distributions is parameterized by mean \(\mu_k\) and precision (i.e. inverse covariance) \(\Lambda_k\). The product is proportional to a Gaussian with mean \(\mu\) and precision \(\Lambda,\) where

\[\begin{aligned} \mu =& \Big(\sum^K_{k=1}\Lambda_k\Big)^{-1} \Big(\sum^K_{k=1} \Lambda_k \mu_k\Big), \\ \Lambda =& \Big(\sum^K_{k=1}\Lambda_k\Big). \end{aligned} \]

Here, we only need to perform \(K\) matrix-vector products and solve \(1\) linear system. So the runtime complexity is \(O(KN^2)\). But this analysis understates the likely speedup from using the precision matrices rather than covariance matrices. Precision matrices are often sparse but with dense inverses; in such cases this latter approach is faster and requires much less memory.

#ML

Last week’s “open thread question” on The Diff was a request for examples of unusually capital-efficient companies. Self-driving startup comma.ai was my selection:

Comma AI is an incredible example of capital discipline. With ~15 engineers and $8.1 million raised capital, it's managed to win first-place in Consumer Reports' rankings of autonomous driving systems. This is partly due to their superior end-to-end learned architecture, yet also due to their superior business model. And these two things are deeply interconnected. The traditional autonomous vehicle is a Rube Goldberg machine of interconnected hardware, software, and ML subsystems; it requires tons of engineers and scientists ($$$) to develop, and cannot generate revenue until everything finally works. By restricting himself to not raising vast amounts of money, George Hotz forced himself to focus on figuring out the “sine qua non” of autonomy: an intelligent agent that (like the human brain) doesn't require perfect maps and perfect vision to drive safely.

George Hotz laid it all out in two blog posts:

https://blog.comma.ai/a-100x-investment-part-1/

https://blog.comma.ai/a-100x-investment-part-2/

One thing not mentioned in the above posts, but explained elsewhere, is the capital and labor efficiency enabled by Comma AI's development of OpenPilot as open-source software. Interestingly, before he got distracted and became a Tweep this past week, Hotz was working on a new startup (The Tiny Corp) applying these lessons to AI hardware accelerators: https://tinygrad.org/ . The first generation of AI hardware startups were probably doomed by their failure to understand the interplay between economics and Amdahl's Law. You can't beat Nvidia by 10x accelerating one type of computing operation, unless you also invest as much as Nvidia in all the other operations so that you're ~ as fast as Nvidia on those operations too. Open-source might fix this problem, as Jim Keller recently explained at TSMC's 2022 Forum: https://youtu.be/o70yKYWgtVI?t=693


Further discussion:

My sense is that they have built a level 2 or level 3 system.

Their first product (essentially complete) controlled the accelerator / brake, their second product (in progress) also controls the steering wheel, and their future plan is to handle navigation to a destination. Because of this product roadmap, it doesn't really make sense to think of their technology in terms of autonomy levels. And this was a fundamental insight — because working on “level 2” as such allowed companies to waste time on non-end-to-end systems that could never scale to “level 5.”

They have also done this in a way in which there is no reporting as to the system's usage and successes or failures.

Capital and labor efficiency (keeping the target on your back as small as possible) is a form of regulatory arbitrage.

I also took Hotz leaving comma.ai as a sign that comma.ai is done.

No, he explained that he didn't feel like a good fit for the CEO role as the company continues to expand: https://geohot.github.io//blog/jekyll/update/2022/10/29/the-heroes-journey.html ”It’s well within comma’s reach to become a 100M+ revenue consumer electronics company (without raising again!), but I don’t think I’m capable of running a company like that. I’ve always heard it takes different people at different company sizes.”


Bonus example of capital discipline:

I recently found out that SpotHero had 217 employees pre-pandemic, ~100 employees mid-pandemic, and 205 employees as of February 2022. This is quite impressive, compared to Uber with ~30k employees in 2021 and Doordash with ~8600 employees in 2021. A recent article described how it’s bouncing back from the pandemic:

SpotHero's bookings dropped 90% in April 2020 compared to February 2020, just before the pandemic started. … Its revenue in 2021 approximately doubled compared in 2020, with reservations increasing 72% year-over-year. The company hired almost 90 employees last year, bringing its headcount to 205, just a dozen employees short of its pre-pandemic total. … SpotHero, founded in 2011 with $120 million in total venture funding, says it's approaching $1 billion in parking sold this year. It's helped park more than 40 million cars to date.

#capitalism

Why are humans (to varying degrees) altruistic? This question has long fascinated biologists and economists. There are multiple evolutionary models explaining the development of altruism, but they generally boil down to group selection, in which groups rather than individuals are subject to selective pressure. Nature prefers genetically-related groups that cooperate with each other, so individual humans are disposed to behave in ways that reduce their own survival/reproductive chances, if doing so favors the survival/reproduction of their kin. In game-theoretic terms, because life resembles an Iterated Prisoners’ Dilemma, given enough time, the only prisoners left in the game are those who cooperate with each other.

Why are humans (to varying degrees) masochistic? This question primarily fascinates psychiatrists. But if group selection explains altruism, it could also explain masochism. From the perspective of a group, masochism at the individual level is a feature, not a bug. In a Prisoners’ Dilemma game, the players will reach the win-win outcome if they cooperate, but also if they are masochistic. Altruism and masochism are equally good at producing the optimal equilibria! And masochism is easier because it works in situations where cooperation cannot be coordinated.

As far as I can tell (as of 2022) there is no prior behavioral genetics literature explaining masochism through the lens of group selection. The potential connection between altruism and masochism is of interest only to the psychoanalysts, and even they are generally skeptical. For example, Seelig & Roth (2001) have the following to say:

Even when an author wished to retain a place for normal altruism, as Simons (1987) did in a panel of the American Psychoanalytic Association on psychoanalytic contributions to psychiatric nosology, altruism is regarded as a subcategory of masochism. Simons defines altruism as a normal form of masochism. We believe, however, that it is clinically and heuristically useful to distinguish altruism from masochism.

It would be desirable to ground this hypothesis on the objective basis of behavioral genetics rather than on psychoanalytic speculation. First, we should find out whether altruism and masochism are correlated phenotypically and genetically. Next, we could look for evidence of selection (both purifying and positive selection). Finally, we could look for colocalization of altruism-linked variants and masochism-linked variants.


Seelig, Beth J., and Lisa S. Rosof. “Normal and pathological altruism.“ Journal of the American Psychoanalytic Association 49.3 (2001): 933-959. https://doi.org/10.1177/00030651010490031901

Simons, Richard G. “Psychoanalytic contributions to psychiatric nosology: Forms of masochistic behavior.“ Journal of the American Psychoanalytic Association 35.3 (1987): 583-608. https://doi.org/10.1177/000306518703500303

In Freda Utley’s fascinating memoir, Odyssey of a Liberal, she wrote about her youthful interest in Machiavelli:

“In my essay on Machiavelli, I argued that there was not really such a disparity as generally supposed between the Florentine’s advice to tyrants, as expressed in his “Prince,” and his eulogy of Republican Virtues in his “Commentaries on Livy” – the Roman classical historian. As I saw it, when fifteen years old, men are usually ready to condone, or even approve, actions taken by their state or country which they condemn when taken by an individual, so that what seemed admirable “virtue” in the Romans was regarded as wickedness in an individual Italian prince. I wish I still had this old essay of mine. All I can now remember is its main argument that Machiavelli’s precepts for Princes – his description of how tyrants maintain their power, which came to be called “Machiavellian,” – was not different in essence to the precepts and practices of the Roman Republic or modern nation states.”

That’s one incisive high school essay!

If certain unchanging principles apply to ancient Romans, Florentine princes, and modern nations, it isn’t a huge leap to believe that these apply to businesses too. On the other hand, corporate Machiavellianism is arguably even less popular than individual Machiavellianism. These two opposing factors help explain why corporations are only Machiavellian to a moderate degree. And your estimate of this degree is probably correlated with your level of disapproval of it.


This was originally published here: https://calvinmccarter.wordpress.com/2022/11/23/the-causes-and-impediments-of-corporate-machiavellianism/

#capitalism

On a weekly “open thread” for The Diff on the subject of layoffs, I wrote:

Although the founders weren’t laid off, they founded Health-Ade Kombucha in 2012 while at GSK when it was going through a series of lay-offs. Daina Trout and Vanessa Dew were both in sales at GSK, and in the early 2010s it was struggling to adjust to competition from generics. GSK notoriously mismanaged the process in 2010 by laying off sales reps just days after promising no layoffs at their sales convention: https://www.cbsnews.com/news/glaxosmithkline-layoffs-follow-promise-of-no-cuts-at-national-sales-meeting/

To improve morale and innovation, GSK started a rotational innovation / leadership program that Daina went through. However, after having this inspirational confidence-building experience, she was sent back to her old job in sales. The rotational program and repeated layoffs were both bad for employee morale by bringing “false dawns” followed by disappointment. Having said that, you might say that GSK’s choices did in fact build confidence and innovation, but only in ways that didn’t benefit GSK itself.

Daina Trout’s interview on the How I Built It podcast is great btw: https://www.npr.org/2020/09/25/916944612/health-ade-kombucha-daina-trout


This was originally published here: https://calvinmccarter.wordpress.com/2022/11/22/on-the-unintended-consequences-of-leadership-training-programs/

#capitalism