Coding 2023-07-17
Now that I have some ideas for the types involved in the actual parsing part of the Earley parser, I've got a better handle on visualizing its operation, and I've concluded that I do, in fact, need nullability checking. Which is a shame. Loup Vaillant has an explanation of what that's supposed to solve, but I'm going to try phrasing it in terms of my "trying to teach this stuff to myself" code structure.
Basically... Start with the three operations: scanning, prediction, and completion. Now, ignore scanning, because it inherently operates on two distinct state sets, so it's unproblematic in this context. Going over prediction, it does operate on just one state set, but once a prediction has been done, there is guaranteed to be no more work. Now, completion is a little more subtle, because it's unproblematic if the item being completed was started in an earlier state set, which is guaranteed to be the case... unless the symbol is nullable.
There are a few ways to deal with this:
- Retry completions until nothing changes, making sure to bring in predictions as those are added.
- Determine the set of nullable symbols immediately before parsing.
- Determine the set of nullable symbols incrementally, as rules are added.
I kind of don't want to do things the first way, because I suspect that that would end in some weird redundant loops.
So, I need to either have nullability calculations right before parsing, or incrementally. I want doing it incrementally to be a slam dunk, but I suspect that it's not. Basically, if you add a single rule, the immediate question is whether it renders the left-hand-side symbol nullable, but if it does, then every non-nullable symbol has to be checked for whether making that first symbol nullable makes that other symbol nullable. Jeffrey Kegler's note on the nullable stuff presents a linear-time algorithm for detecting nullables. I'm going to try to walk through what doing it incrementally would look like.
So, there are going to be several auxiliary maps. These maps only need to concern themselves with rules that have no terminal symbols. One map from LHS to rules. One map from RHS symbols to sets of rules (including LHS I think). A set of nonterminals with empty rules. A mapping from every non-terminal in the LHSs to whether that is a nullable symbol, false by default.
At this point, things diverge sharply and I kind of have to wing it. I believe I'll have to re-derive the time bounds of what I'm about to propose, if I want to know what they are.
When adding a rule:
- If there are terminal symbols, ignore it.
- If the rule is empty, add it to the set of empty rules. Actually, this step and data might not be needed.
- If the LHS is already marked nullable, ignore it.
- If the RHS is made up only of nullable symbols, mark the LHS as nullable, and then, iterate over every rule with the LHS in the RHS, to see whether any of those rules now have only nullable symbols in the RHS. If so, mark the associated LHS as nullable and recurse.
I'm honestly missing what anything besides the RHS map and the nullability map is accomplishing in this, so I'm going to see how far I can get with just those.
One thing I note is that, once a symbol has been marked nullable, there's no reason to check any rules with that LHS, so it could be efficient to remove such rules from the RHS sets. Potentially there could be a different way to store the RHS data other than literally, like as sets of symbols not known to be nullable. Then, when a set gets to zero, mark the LHS as nullable and cut down on stuff. You know, stuff.
I think this is enough to get started with tweaking the representation. Looking over it, I'm highly skeptical that what I have in mind is worst-case linear, but it shouldn't lead to any huge degradations in grammars that don't "need" this stuff.
One more pass at trying to make sense of this:
To mark a rule nullable, check the RHS table. It should be safe for the RHS table to contain tables from the LHS to the sets of symbols not known to be nullable. When a symbol is marked nullable, then that means that it needs to be removed from every other entry. For each other RHS symbol, put together a set of LHS symbols to check. If any LHS symbol is associated with the set of just the nullable symbol, then that symbol has to be marked nullable on the next pass. For LHS symbols with other sets, those sets represent RHS symbols that need to be updated; doing so is guaranteed not to produce more nullable symbols.
I'm still not sure about the time bounds, but I think I can get this to make sense, so I'm going to take down some notes on how I need to change up stuff, and then get to bed.
Good night.