Coding 2022-01-10
The linux side of things works a little better now. In this context, "a little better" somehow means "my laptop now has an input mode that has Cyrillic letters and fullwidth spaces".
But anyway. We handled some stuff today that took up part of the afternoon, and now I want to see if I can get a handle on taking the recognizer and getting it to produce parse trees.
Let's see how that looks.
My initial thought is to have it work like "given an index that defaults to -1, generate all valid trees that end at that index". (I'm using Python indexing, so that's how that means what I want it to.) First some context: instead of storing an index for the dot in the items, I'm storing a tuple of Optional[Token]s, which I'm calling nodes. The Forest objects that I'm writing this for store a reference to the type of the topmost node, so things start out as follows:
- In the last index, find all completed items that start at index 0 and have a rule with a left-hand-side of the root node type
- This is a degenerate case at the top level, but if the item has no nodes associated with it, then it's an empty rule that matched an empty token stream. Call the rule's function with no arguments, and yield the result. At the top level, there is no more work to do. (Assuming that rules can't be identical except for their associated function, which I'm not trying to enforce, but is... probably reasonable? This might suggest a tiered mapping structure for storing rules in the parser object.)
- If the node corresponding to the symbol before the dot is a token, then the item was generated by a scan, and the thing to look at is the previous index, for that rule, but with the last node discarded. Technically, we shouldn't really need to look, per se. There's only one possible value of the predecessor to a scanned item, and we know it exists in the previous set, so we don't actually have to find it. I'll think through the precise implications of this later.
- So, the meat of all of this is dealing with a non-terminal symbol before the dot. To look for a non-terminal symbol at a given index, you have to look for all completed items in that index, with a left-hand-side matching that symbol, and a start index greater than the parent's start index. Each such item will produce at least one valid internal parse (unless I'm extremely mistaken), and all such parses will relate identically to the broader parses, if any. Essentially, it's a product between "recursing the algorithm in smaller bounds" and "the rest", where "the rest" is "whatever item has the sub-parse fitting into it, discard that node and look for the resulting item at the start index of the sub tree".
I think this is about what the tutorial was saying, and I've tried to reword it for my own comprehension, but I'm not sure the result is helpful for anyone else. It's basically notes, a little less refined than a rough draft. If anyone is following along with this and trying to understand this for themselves, I would highly recommend taking your own notes along these lines, in whatever way and format makes the most sense to you. Not least because I haven't yet vetted this stuff, so there's a chance some aspect of it is simply wrong.
Anyway, I should get this published and get to bed ASAP.
Good night.