Three Dollar Quill

Coding 2022-01-06

Thu 06 January 2022

By Max Woerner Chase

I'm going to sketch out some thoughts on implementing an Earley parser. The tutorial I'm reading is talking in terms of loops, but I'm trying to think about this in terms of reducing a stream of tokens. I'm pretty sure this will work, and it means I can punt on figuring out the outer loop for a bit.

Here's the basic idea: I have a series of rules, and that's bundled up in an object that defines a method like "given a state and a token, return a new state". Then I can just generate the initial state, functools.reduce, and bam, done. ... Probably.

So, because this is an Earley parser, the state is going to be something like "a list of state sets", where a state set is "a set of Earley items", where an Earley item is "a rule, the index of when it started in the state, and partial completion data". For a recogniser, the completion data is just how far you've read, but if you want to parse, I assume it needs to accumulate the previously-parsed items instead. I should... probably read further in the tutorial, I think.

All right, I read over it, and here's where my understanding is:

Loup is talking, at least at first, about extracting the parse data from the information gathered by the recognizer.
What I had in mind would have been very inefficient, because it would have essentially performed eager evaluations that wouldn't succeed.

So, basically, I want to say:

Given a state, can we extract a valid parse tree from it?
Given a parse tree, we can materialize an AST (which would presumably include information like text positions, or anything else helpful), but this means that we can't carry out the actions until we have a parse tree, which means my ideas about discarding the token stream as we process it do not work.
Like, it's possible to write the code to consume a stream, but that's only worth it if it makes the code easier to understand, because it doesn't do anything memory-wise. I mean, I guess it would allow early termination if the token stream is bad and it results in an empty state set.

Ooh, I've got an idea that could be really good or really weird. Suppose we have some kind of stream-recording data structure, so we can then make our functions operate on streams. If we organize the generation of items according to Jeffrey Kegler's open letter (scans, completions, predictions), then I think it should be possible to say "okay, we can view a set as consuming one iterable and producing two, by taking an initial iterator from the previous set and expanding on it, and also producing an iterator of scans for the next set". By interleaving the order of iteration, it "should" be possible to make sure that all possible parses are reached, even if some rules recurse indefinitely (which, why would you do that, but whatever). (Although, if there's an indefinitely recursive rule and a syntax error, then this idea would presumably keep on trying the same stream of nullable rules in the hope that that will somehow repair the parse downstream... So it's almost like, indefinitely recursive rules should trigger once-ish for the initial parse pass, and then they can kind of do whatever when it comes time to perform actions on a tree.)

In any case, I'm going to need to do a bunch of sketching of this stuff on paper, because these concepts don't all fit in my head.

And, I've got some other stuff I want to take care of right now, so I'm going to wrap this up here.

Good night.