I’m trying to write Modal.ed, a compiler from Modal to ed(1) scripts. Which is quite a story of its own, I guess? I’ll leave it for later.
One part of this project is reliably parsing Modal syntax. Here are some examples of Modal rules:
<> ((unwrap ?x)) (unwrap ?x) <> (q ?x) (q ?x) <> (`?: ?0 ?1) ((Int ?:)) <> (?: print ') (?:) <> (iterating ?x ?y ?n (1) ?a ?b) (?((?:) ?:) \s\s) <> (?0 ?1 `?:) ?: <> rule0 data
The general pattern (hopefully apparent from the snippets) is:
So all we need is to parse balanced parentheses in ed. Easy, right? This post is an exercise of desperation trying to parse parentheses with regex. A journey that has a happily ever after, unfortunately. But here’s a bit of boring stuff first:
The search for perfect regex led me to StackOverflow. To the “Regular expression to match balanced parentheses” question in particular. A place where the best minds of the industry were either:
.NET and Perl regex are suggested multiple times, with a recurring form of
\((?:[^)(]+|(?R))*+\)
Perl is nice, but There Is No Such Thing As The Regex, and I have a soft spot for ed regex, and POSIX BREs in particular. These don’t have recursive matching. So I was bored out of my mind reading all these StackOverflow exercises in conformity. And I decided to prove them wrong: One does not need .NET or Perl to parse parentheses properly. POSIX BREs and text substitution are more than enough to work with parentheses and Modal rules.
So the simplest way to match parentheses is to... match parentheses. And nothing/anything else in between:
(.*)
There are multiple cases where this regex breaks:
We need more structure to paren matching. And we’ll have it:
So we don’t want stray parens and we need balanced parentheses. We can arrange exactly that:
([^()]*)
So we match two parentheses and some number of non-parentheses inside. Which doesn’t give us false positives like in the first iteration. But doesn’t scale to more nested patterns.
There has to be a way that covers most of the practical cases. I’m ready to cut corners, like too deeply nested parentheses. Which means most cases can be hard-coded:
# Level one
([^()]*)
# Level two
(\([^() ]\{0,1\}\(([^()]*)\)\{0,1\}\)*)
# Level three
(\([^() ]\{0,1\}\((\([^() ]\{0,1\}\(([^()]*)\)\{0,1\}\)*)\)\{0,1\}\)*)
# Et cetera
The second and third ones look weird, but it’s a way of saying:
there must be a monolythic symbol (like
Elegant, but less portable than BREs.
I need to run it on BSD, UNIX, and GNU ed, so begone EREs!
Another way to ensure the balance is to restrict the inputs.
Say, we can require Modal rules to be Lisp-like in only having prefix operators.
With all the argument at the tail.
This would result in a dreaded “tiger tail” or parentheses at the end.
But it also makes the parsing problem irrelevant: just match the tail and you’re golden:
Most of Modal codebases are somewhat lispy already.
So this regex has a higher success/LoC ratio than, for example, hardcoded rules.
But I can see why one might be uneasy about it—it’s not really structural.
Unlike...
The structure of the Modal rules is:
So it dawned on me: we can isolate symbols,
then use these to highlight the simple lists and then propagate up the list ladder.
So the algorithm I came up with is:
And here’s (an arguably ugly) code that does this in my ed script:
You may say I’m cheating, because it’s not only regex, it’s also substitution rules.
But I warned you!
The results are relatively good:
So I have my parsing rules happily chewing on Modal code.
In Modal.ed,
check it out!
Thanks for lending me a ear and embracing the power of regex 🖤
\(([^()]+|\([^()]*\))*\)
Lispy: Looking for Tiger Tails
((*[^)]*)*)
Recursive, But Not That One
It’s a symbol or register.
Wrap all simple symbols in Russian quotes: «». (Because they are distinctive and unique enough for parsing, and because I’m ethnically Russian.)
# Wrap every symbol/nil into quotes
g/[^() ]\{1,\}/s//«&»/g
g/()/s//«&»/g
# Actual parsing: if the current layer of parens is all quoted, quote it and unquote contents
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6 \7 \8 \9)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6 \7 \8)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6 \7)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3)»/g
g/(«\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2)»/g
g/(«\([^»]*\)»)/s//«(\1)»/g
# Repeating second time for recursive case
# Ad infinitum
The parsed depth is at worst the number of copy-pasted ladders.