Three Dollar Quill

Coding 2022-01-14

Fri 14 January 2022

By Max Woerner Chase

The good news is, I managed to construct a successful parse. The not-as-good news is, I had to fix some bugs before it went through. The worse news is, while almost all of the bugs made sense in retrospect, I think I'm going to need to do a close reading of the entire tutorial to understand where I went wrong with the final bug.

Easy stuff:

"Oh, I forgot to include the token value when I do a scan." This is totally my fault, since it's a failure to fulfill a requirement that I came up with.
"I missed the final predict/complete step." This is a bit more interesting, because it reveals something that I glossed over about Earley parsers: in a successful parse, there will be one more state set than there are tokens, because the state sets border the tokens on either side.

And now that I say that, I understand what went wrong with the last bug. This is a doozy, so let's see whether I can explain it.

All right, so my parse generators rely on several recursive functions. The function that I broke takes an item and an "end index" to a state set, and, looking in the range between the end index, and the item's "start index", yields all possible sub-divisions of that range into the symbols required by the rule. This function calls itself recursively for one of two reasons:

If the item has a non-None value as its last node, then that value corresponds to a terminal symbol, so the node is skipped over: the node is discarded, and the end index is decremented so we check the previous state set.
Otherwise, the code checks all items at the end index, to see if it can find a completed item that matches the non-terminal symbol required by the item's rule at this point. If it does, then we can get the rest of the division by... discarding the node and choosing an appropriate value for the end index.

And that last bit is where I tripped up. I forgot that the state set indices go between tokens, so I tried to "skip over" an index, because the last token used by one symbol is the token before the first token used by the next symbol. This doesn't work, because "the ending of one symbol" and "the beginning of the next symbol" correspond to the same state set index. So I tried to "compensate" for something when I should have left well enough alone.

Next up, now that I sort of understand things, I see that I absolutely did not account for empty rules. (I first verified this experimentally.) So, before I can properly move on, I'll want to fix that.

Let's see what fixing that looks like.

Well, first off, I need a constructor helper for Parser objects.

...

I got distracted and did the fix. This basically consisted of implementing the algorithm that Jeffrey Kegler provided for detecting nullable symbols, undoing some premature optimizations I had done that interfered with actually using the data on nullability, fixing a code divergence because apparently that happens nearly immediately after I copy and paste some code less than ten lines down from its origin, and pondering whether I have it right, because the fixed code has more items in its state set than the code from the tutorial.

I think this is okay, because if I'm understanding things correctly, the tutorial is showing an intermediate stage of parsing. I could be totally wrong, though.

Hm, interesting question that just occurred to me: is my current parse generation code able to properly handle nullable sequences that can go arbitrarily deep? I'm skeptical, so maybe just don't do that.

I think it might actually act really pathological. But it's getting late, and I don't want to think about that too hard.

Since I've gotten a successful parse, maybe I should move on for now, and come back to this code later.

Good night.