programming

Two cheers for the file system

November 27, 2022

The file system is a fundamental part of the API provided by your operating system. Yet it’s also getting long in the tooth. Last year The Verge reported that college professors are struggling to explain the concept of files and file systems to students who grew up on smartphones and clouds:

It’s the idea that a modern computer doesn’t just save a file in an infinite expanse; it saves it in the “Downloads” folder, the “Desktop” folder, or the “Documents” folder, all of which live within “This PC,” and each of which might have folders nested within them, too. It’s an idea that’s likely intuitive to any computer user who remembers the floppy disk.

More broadly, directory structure connotes physical placement — the idea that a file stored on a computer is located somewhere on that computer, in a specific and discrete location. That’s a concept that’s always felt obvious to Garland but seems completely alien to her students.

The concept that data is a thing that is stored in a location is not just an artifact from the era of personal computers and slow 56k internet. It is the essence of computation as such. When Alan Turing proposed the Universal Turing Machine, he created an abstract mathematical model of computation, not a physical object. The API of a Turing Machine (its tape) is equivalent to the API of the Unix filesystem is equivalent to the API of the Python interpreter.

To be able to compute is to be able to arbitrarily read, write, and manipulate data. You cannot compute on your iPhone or on Google Drive. The college freshmen who know only the new locked-down APIs have been deprived of the expansive experience of general-purpose computing. They have known only the narrow experience of restricted consumption. To be deprived of this experience is to lack the intuitions that make it easier to learn programming, and the intuitions that make it possible to envision world-changing innovations applying the power of computation.

So why only two cheers for the file system?

First, while operating systems (thinking especially of Unix here) offer a decent API for reading and writing objects, their API for manipulating objects is painfully bad. Even though the Unix file system offers an amazing API for storing and reusing intermediate results, shell scripts are terrible. So people instead write data-processing programs in Python, but this creates a new set of problems. In data science (especially bioinformatics) storing and reusing intermediate results via the file system helps you save computing time. It also helps you inspect and debug complex pipelines, and recover (partially) when programs are killed by unhandled exceptions.

Second, I fudged about filesystems being “the essence of computation as such.” Alonzo Church invented the \(\lambda\)-calculus, which defines the same set of computable functions as that defined by Turing. Functional programming offers an alternate API for specifying programs, one which may be better suited to networks of interacting users. The Urbit operating system and ecosystem is currently being developed on this premise, and I’m extremely excited about its prospects for making the Internet personal and computable again.

Despite these drawbacks, I believe that the Turing Machine paradigm will endure. Scientific simulations, ML model training, and data analytics will continue to grow in importance. These tasks (which Turing himself helped pioneer) are well-suited to filesystem APIs. Packages like Snakemake are helping unite the benefits of Python and the filesystem. In contrast, customizable yet not Turing-complete systems are inappropriate as end-to-end solutions for non-trivial modeling scenarios. You inevitably bump into the pre-ordained limitations of GUIs, configuration files, and SQL queries, so you end up writing code. To paraphrase Greenspun’s 10th rule, any sufficiently complicated declarative language contains an ad hoc, informally-specified, bug-ridden, slow implementation of \(1/\infty\) of a Turing-complete language.

Due to the beginner-friendly benefits of imperative programming, and the importance of the filesystem for data-processing tasks, familiarity with the filesystem will remain essential.

#programming

DRYAD: an evolutionary alternative to DRY (“don’t repeat yourself”)

November 26, 2022

A growing number of programmers have observed that the object-oriented programming (OOP) dogma DRY (“don’t repeat yourself”) is sometimes actually harmful. A number of alternatives have been proposed, all of which involve thinking clearly and carefully. Think about the right level of abstraction at which shared implementation makes sense. Think about avoiding shared logic and ideas, rather than shared code. Think about bounded contexts, and avoid straying across them.

These are all great ideas, but they all require careful thought. Unfortunately, in many situations developers lack the ability (especially places that had massively duplicated code pre-DRY) or time (especially research settings) to think through these delicate questions.

Evolution offers a useful analogy to think through this problem in such settings. Genetic evolution enables adaptability through duplication. Quantitative traits are encoded across hundreds or thousands of parallel genetic variants. These polygenic traits are able to evolve much more rapidly than monogenic traits where a single genetic variant underlies all variation in a population. Sexual reproduction breaks up highly-correlated genetic variants (“linkage disequilibrium”) so that they can evolve independently to assist the fitness of a species. And the code for these traits is not only duplicated within a single individual’s genetic code: they are duplicated among all the individuals in a population.

Deep neural networks also evolve via a learning algorithm, such as stochastic gradient descent, and they utilize duplication to adapt their weights. In recent years, it has been discovered that wide “overparameterized networks”, with more activations than the number of samples, have better learning dynamics. This extra capacity is not strictly necessary to represent complex functions, but networks are more easily able to move their weights to such functions with this overcapacity.

In software development, when we tolerate duplicated code rather than force shared implementation, each copy can evolve in its natural direction, unencumbered by other use cases. Of course, this approach has a cost: code bloat. Fortunately, evolution offers us a solution: pruning. In biological evolution, natural selection filters out harmful genetic variants. And increasingly, pruning is used in deep learning to speed up inference of over-parameterized networks. Notably, pruning in both evolution and deep learning works very well with even simple strategies. It is often hard to find an accurate neural net, but once you have one, it’s easy to find hidden units that can be discarded. The evolutionary alternative to DRY requires only discipline rather than forethought: do repeat yourself and deduplicate (DRYAD).

DRYAD is a useful approach in R&D as well, at both an individual level and team level. When getting started on a new research project, it’s useful to copy-and-paste old code, recklessly changing it to meet one’s needs. As the project matures, one can readily identify duplicated code, and then start merging them. Similarly, it is hard for a development team to figure out and agree in advance on which code logic can rely on shared implementation. Yet, so long as all programmers have the discipline to “clean up after themselves” — to look out for old duplicated code and remove it — code quality can be maintained. Code deduplication (while it may not be fun) is actually intellectually easy, because it comes with the benefit of hindsight.

This was originally published here: https://calvinmccarter.wordpress.com/2022/02/19/mind-blindness-strategy-for-eliciting-latent-knowledge/

#programming

Resizing nested C++ STL vectors

November 26, 2022

“Multidimensional” vectors in C++ do not behave like matrices (or higher-order equivalents). Something like:

vector< vector<double> > foo(100, vector<double>(20, 0.0));

will not lay out 100*20 doubles contiguously in memory. Only the bookkeeping info for 100 vector’s will be laid out contiguously — each vector will store its actual data in its own location on the heap. Thus, each vector can have its own size.

This can lead to hard-to-catch bugs:

foo.resize(300, vector<double>(30, 1.0));

will leave the first 100 vector’s with size 20, filled with 0.0 values, while the new 200 vector’s will have size 30, filled with 1.0 values.

This was originally published here: https://calvinmccarter.wordpress.com/2015/05/19/resizing-nested-c-stl-vectors/

#programming