Coding 2020-12-19

By Max Woerner Chase

I didn't touch pip today, and I don't want to force myself through it, so I think I'll finally start talking about the side project I was putting in some time on.

I found out about transducers a few months ago from Hillel Wayne's newsletter and twitter. I found it extremely hard to reconstruct some aspects of the Discourseā„¢ several years down the line (in particular, I don't think I've tracked down any specific examples of Rich Hickey calling something "not transducers"), but the basic concept seems pretty intuitive to me by now. Consider a section of code that operates on discrete pieces of data one at a time. We can reify the operation into a function that takes a value representing a state or "accumulator", and one item to operate on, and returns a new value for the accumulator, that can then be passed back in. Such a function is called a "reducing function". "Transducers" is the new name for "reducing function transformers", functions that take a reducing function, and return a new one that does some additional operation to the input before passing it to the wrapped function. The contracts of transducers in Clojure are somewhat more complex, to handle early termination and resource cleanup.

(People who know Haskell better than I do: apparently this is a special or "perverse" case of something called a "lens".)

I'm going to talk now about why anyone went to all of this trouble, and to do that, I'm going to switch programming languages, since my Lisp is extremely rusty. (Nothing to do with any of the three projects I just found trying to combine Rust and Lisp.)

When functions like map were added to Python, they took lists, and returned lists. Since Python is eagerly evaluated, compared to languages like Haskell, this meant that both the input and the output lists had to exist in memory simultaneously, but so did any intermediate lists. These operations would build up each intermediate list in memory, then discard it. As such the memory consumed by such an operation would be several times that of the only lists that the user specifically cared about: the input and the output.

Various additions and reforms were made to the standard library to combat this. The builtin functions were given the ability to accept iterable values, which could produce values one-at-a-time, but they could not change their output type until the Python 3 compatibility break. Until then, Python 2.3 and onward had the itertools library, which provided versions of builtin functions that return iterable values, and still provides additional functions for working with iterables, that did not make it into the builtins module.

Until Python 3, we see this kind of tabular relation, where one axis is what we're doing to the data, and the other axis is how we're passing data in and out. map corresponds to imap, filter corresponds to ifilter, and so on. With Python 3, the standard library took the general approach of using iterables where possible; if the old behavior was desired, the output could be wrapped in a call to list. So, the duplication of effort was resolved. Except. Several years later, the async and await keywords were added to Python 3.5. In particular, the async for statement was added, which operates over asynchronous iterables. There have been efforts to make async versions of itertools functions, and I can't say for sure that the builtin functions won't somehow adapt to support both kinds of iterable, but any such efforts are not guaranteed to extend cleanly to third-party libraries, and the more of these functions we care about, the more effort it is to get every version of them up to the same level of functionality, the more raw code there is, and thefore the greater risk of bugs. With functions written like map and filter, we're stuck with the tabular relation between what we're doing to the data and how we're passing data in and out.

What if we could abstract those concerns apart, so that, if we have m stream operations, and n stream sources, we have m + n functions, rather than m * n? That's the motivation for rewriting these functions as reducing functions! That way, we can have "a reducing function that maps" and "a function that calls a reducing function in a for loop" and recover Python 3's map. More interestingly, if the interfaces involved are solid enough, two different projects can define these functions independently, and another project could put them together, and have it just work.

Now, that motivation is for reducing functions. It doesn't touch on transducers. The idea of transducers is to make it easier to write a desired reducing function, by applying a sequence of transformations to the input data. That way, the function to drive the reducer only needs to be called once, and the data doesn't have to constantly round-trip through different formats. It's the difference between the_same_driver_every_time(reducer_1, the_same_driver_every_time(reducer_2, source)) and the_same_driver_every_time(transducer_1(transducer_2(base_reducer)), source). The difference can add up as you add more operations.

So, transducers seem pretty cool to me. On their own, they haven't yet convinced me to pick up Clojure, so I've been looking into transducers implementations for Python. There's an existing project, but it hasn't been updated in years, and the code style is... not what I would write.

("Oh boy, here we go again.")

There are many decisions I could disagree with, but the biggest issue for me is the lack of type annotations. I've had trouble in the past articulating why I think people should use type analysis tools for Python, because the gains in documentation quality and confidence were just so, so obvious to me once I tried it. Now, typed transducers (ignoring the Haskellers going "They're lenses!") is apparently a contentious topic. For my part, I find the argument of "the type of the accumulator can change between steps, so you can't assign it a static type" unpersuasive. Just use a union. ("Sum!" "Enum!") It's true that using unions obscures runtime relations between input and output types, but I don't know where the implicit requirement that the static types have such detailed knowledge of the runtime relations comes from. What would the type system even do with that information? Unless the length of the input is known statically (which seems like a dubious assumption when potentially infinite streams are involved), the final value has to collapse into a union anyway, and if it is known statically, then we're getting into dependent types.

To put it another way, this sounds to me like rejecting typing systems in general, because there exist typing systems that cannot handle function definitions like double: Int -> EvenInt and increment: EvenInt -> OddInt; increment: OddInt -> EvenInt.

All this is to say that I'm not going to let Rich Hickey's opinions stop me from trying to add type annotations to the python-transducers package. I will let the code itself stop me, though. In my efforts, I ran across code that I couldn't properly annotate because, considering what it actually does, I couldn't convince myself it was correct. As a result of this, I attempted a from-scratch rewrite with types from the beginning, rather than forced in afterwards. This effort failed, because I hadn't looked at the provided reducing functions, some of which do things that surprised me, and which defeated my first attempt at typing.

After I had some time to think, I began my second rewrite, in which I used my intuitions about types to guide changes to the interfaces. As such, even though the code I have fulfills the same goals as transducers, as I understand them, I actually wouldn't mind if anyone looked at them and said "those aren't transducers!" There are obvious differences, that exist for possibly-subtle reasons. This is entirely hypothetical, for now, because I'm not currently in the habit of shoving out half-finished projects that aren't even used anywhere. (Do I want to do anything about my current slate of half-finished, unused projects? Eeehhh, effort.)

Maybe later I'll go into more detail about the code I actually wrote, but first I'd like to confirm that it's usable, and that, um, has kind of a stumbling block, because my rewrites are in the genre of "writing a bunch of code because the structure of the code provides an intellectual challenge". I'm not writing a standard library, or any generally-useful extensions. I don't have any projects where I see myself explicitly being generic in how a reducing function is to be called. I've done a bunch of these kinds of solutions-in-search-of-problems over the years, and they're probably educational, but they eventually stop being satisfying, which is why I'm leaning towards putting this rewrite on the back burner.

Anyway, it's late, and I want to be done with this.

Good night.