All right, my conclusion from thinking about trying to implement Earley parsing differently, is that I really should get a basic working implementation together. Prototype based on a design, before messing with the design. So, let's take this closely. First off, I've got some code that is very loosely based on Spark, so let's take stock of that.
- Token matcher
- Scanner to produce tokens
- First pass at a "Rule" object for the parser, which is really weird from a typing perspective, because it reflects on the annotations of the parser action, which means that it has to be covariant in both the arguments and the return value of the function, which I assume actually supporting would involve writing a Mypy plugin for. (Also, I think the way I'm setting this up my be a design misstep on my part, but I'd like to see it through and focus on the algorithm, which shouldn't be affected by the bit I'm not sure about.)
- Earley item class
- The beginning of a Parser class, which currently has the bare minimum of data required, with no optimizations, and just enough code for the parse method to be syntactically invalid because I'm not sure how to write the loops.
Anyway, let's take the tutorial slowly.
- "A grammar is a set of rules." Check.
- "non-terminal symbols [...] and terminal symbols" I should write a method to distinguish terminal and non-terminal symbols according to the... somewhat questionable way I'm representing symbols.
- Earley items... There are some obvious convenience methods to add to what I have.
- State sets... I'm going to need to think carefully about what actions to take on state sets, because I want to make sure I can work with them semi-efficiently. So, I've got "a container of Item objects", and it may be the case that a set would be better than a list, so I'll make that change. All right, initialization is done, I think. I have no idea how it relates to the corresponding-ish code in Spark, but whatever.
Okay, I think I've written what I need for the outer loop, and some helpful stuff for the inner loop. Let's compare the effects of prediction, scanning, and completion.
- Scanning adds an advanced item to the next set (If a rule's next symbol is terminal)
- Prediction adds a new item to the current set (If a rule's next symbol is non-terminal)
- Completion adds an advanced item to the current set (If a rule is completed, and therefore has no next symbol)
Thinking about how these relate, prediction is the only process which adds a new starting index. I'm not sure if that is relevant to anything.
My big concern with these is, according to the open letter I linked yesterday, the different types of operation can be strictly sequenced, but I don't understand how that's possible. Let's set aside scans, since those have much easier behavior: you can just iterate through one set, and perform all scans, into the next set.
So, doing completions and then predictions works if we know that prediction will never add a completed set. If we allow empty rules, then it currently appears to me that the easiest way to guarantee that prediction doesn't add completed rules would be to essentially run any possible completions "in-line" as part of prediction.
It's getting late, so I'm going to take "I thought about this and wrote some helper functions" as a win, and get ready for bed.