Today, I'd like to do the planning/design for the environment matching in here.
I'm thinking a helper class that combines a phoneme sequence and an index; it can perform one of several checks on demand.
Anyway, the trickiest bit is the automaton design. Here are a few points to keep in mind:
- The automaton corresponding to an empty sequence always matches
- The "earliest" that any state will refer back in the collection of states is to itself
- Python's builtin sequences allow for indexing from the end.
This means that the automaton can be built in a single pass by starting at the end of the match, and if it's done arena-style (I can't stomach the idea of repeatedly hashing the entire state graph), then the start state has index -1, and the state at index 0 is the success state.
States need to be capable of:
- perform a literal match and advance
- check for set membership and advance
- provide two states to check against the current position
- accept any input and provide two indices
- check the gemination status and advance
- check the word boundary status, and provide another state to check against the current position
- Being detectable as the success state before any further processing occurs
Specific cases to check for: optionals or wildcards on their own should succeed immediately; this means that my "single-pass" idea might not pan out, unless the matching process is broken up very carefully:
Given an initial set of states (from the starting conditions or the previous step), generate a set of states that advance the string. If the success state is encountered in this process, the match succeeds. The following three states must be processed as part of this stage:
- The branch state must give its two targets as candidates
- The wildcard state must give its successor as a candidate, and itself as a thing to perform a match
- The word boundary state must give its successor as a candidate, but only if the word boundary condition evaluates to true.
This can be viewed as a combination of "does this state have transitions that consume a phoneme?" and "what transitions does this state have that do not consume a phoneme?"
The stage can then be broken up into:
While there are candidates: process each candidate by adding its candidates to a new set, and possibly itself to a consumer set; if the success state is encountered, succeed; at some point filter out all candidates that have been seen from the new set; make the new set the candidate set; the output of this stage is the consumer set; if the consumer set is empty, then fail.
Although, if the consumer set is empty, then processing it will yield no candidates, so if the loop condition is the truthiness of the candidate set, then I could have an early return true inside the loops, and a fallback return false afterwards.
Wait, I'm confusing myself. This isn't a while loop, it's a for loop. Unless I make it a while loop and have the cursor handle advancing, which seems acceptable, and allows me to perform processing steps even when there aren't phonemes to process, which is necessary for the empty sequence to match the empty sequence.
I think I've got it all together. Now I just need to be focused on this stuff long enough to get it all into code.