5 (Wrong) Regex To Parse Parentheses

Regex are powerful. To the point you may try to parse HTML or Lisp with it. A doomed enterprise, right? But it's possible, actually. assets/paren-regex.png A banner on light gray (smoke shade) background. On it, REGEX 4 PARENS is written in large parentheses-mimicking letters. 'PARENS' is interspersed with green parentheses. In the corner, it reads 'AARTAKA' in same parenthetical letters. IMAGE_ALT

I’m trying to write Modal.ed, a compiler from Modal to ed(1) scripts. Which is quite a story of its own, I guess? I’ll leave it for later.

One part of this project is reliably parsing Modal syntax. Here are some examples of Modal rules:

<> ((unwrap ?x)) (unwrap ?x)
<> (q ?x) (q ?x)
<> (`?: ?0 ?1) ((Int ?:))
<> (?: print ') (?:)
<> (iterating ?x ?y ?n    (1) ?a ?b) (?((?:) ?:) \s\s)
<> (?0 ?1 `?:) ?:
<> rule0 data

The general pattern (hopefully apparent from the snippets) is:

Rules have a left hand side (pattern) and a right hand side (replacement.)
Both sides are wrapped in parentheses, but parens are sometimes omitted for obvious single-symbol cases.
The pattern/replacement can get as complex as the programmer wishes. But (hopefully) the parentheses are reliably balanced.

So all we need is to parse balanced parentheses in ed. Easy, right? This post is an exercise of desperation trying to parse parentheses with regex. A journey that has a happily ever after, unfortunately. But here’s a bit of boring stuff first:

Pragmatic: Recursive Regex

The search for perfect regex led me to StackOverflow. To the “Regular expression to match balanced parentheses” question in particular. A place where the best minds of the industry were either:

Suggesting non-portable vendor-specific regex extensions.

.NET and Perl regex are suggested multiple times, with a recurring form of

\((?:[^)(]+|(?R))*+\)

Perl is nice, but There Is No Such Thing As The Regex, and I have a soft spot for ed regex, and POSIX BREs in particular. These don’t have recursive matching. So I was bored out of my mind reading all these StackOverflow exercises in conformity. And I decided to prove them wrong: One does not need .NET or Perl to parse parentheses properly. POSIX BREs and text substitution are more than enough to work with parentheses and Modal rules.

Naive: It Is Balanced, Right?

So the simplest way to match parentheses is to... match parentheses. And nothing/anything else in between:

(.*)

There are multiple cases where this regex breaks:

Is matched as one piece due to regex greediness.
())))))): Is matched too, because we actually don’t look for balance.
()((((((): A variation on the previous too, but ugly enough to include.

We need more structure to paren matching. And we’ll have it:

Restrictive: I Will Be Correct Whatever the Cost

So we don’t want stray parens and we need balanced parentheses. We can arrange exactly that:

([^()]*)

So we match two parentheses and some number of non-parentheses inside. Which doesn’t give us false positives like in the first iteration. But doesn’t scale to more nested patterns.

Pedantic: Hardcoding Patterns

There has to be a way that covers most of the practical cases. I’m ready to cut corners, like too deeply nested parentheses. Which means most cases can be hard-coded:

# Level one
([^()]*)
# Level two
(\([^() ]\{0,1\}\(([^()]*)\)\{0,1\}\)*)
# Level three
(\([^() ]\{0,1\}\((\([^() ]\{0,1\}\(([^()]*)\)\{0,1\}\)*)\)\{0,1\}\)*)
# Et cetera

The second and third ones look weird, but it’s a way of saying: there must be a monolythic symbol (like hello) or a parenthesised list of symbols (like (hello there)) inside these two parentheses. This is due to BREs having no alternation pattern, CD(|). These rules can be conveniently rewritten in POSIX EREs (that’s the last time I employ such a bloated regex format in this post!)

\(([^()]+|\([^()]*\))*\)

Elegant, but less portable than BREs. I need to run it on BSD, UNIX, and GNU ed, so begone EREs!

Lispy: Looking for Tiger Tails

Another way to ensure the balance is to restrict the inputs. Say, we can require Modal rules to be Lisp-like in only having prefix operators. With all the argument at the tail. This would result in a dreaded “tiger tail” or parentheses at the end. But it also makes the parsing problem irrelevant: just match the tail and you’re golden:

((*[^)]*)*)

Most of Modal codebases are somewhat lispy already. So this regex has a higher success/LoC ratio than, for example, hardcoded rules. But I can see why one might be uneasy about it—it’s not really structural. Unlike...

Recursive, But Not That One

The structure of the Modal rules is:

Or it’s a list of symbols/registers.
Or it’s a list of symbols/registers and other lists.
And so on.

So it dawned on me: we can isolate symbols, then use these to highlight the simple lists and then propagate up the list ladder. So the algorithm I came up with is:

If the list consists entirely of quoted forms or is empty, remove all quotes inside and wrap the list itself into quotes.
Repeat until there’s nothing to substitute and the quoted list is the toplevel one.

And here’s (an arguably ugly) code that does this in my ed script:

# Wrap every symbol/nil into quotes
g/[^() ]\{1,\}/s//«&»/g
g/()/s//«&»/g
# Actual parsing: if the current layer of parens is all quoted, quote it and unquote contents
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6 \7 \8 \9)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6 \7 \8)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6 \7)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3)»/g
g/(«\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2)»/g
g/(«\([^»]*\)»)/s//«(\1)»/g
# Repeating second time for recursive case
# Ad infinitum

You may say I’m cheating, because it’s not only regex, it’s also substitution rules. But I warned you! The results are relatively good:

And at best the number of inidividual substitutions.
The average lies somewhere around the number I don’t know how to compute, but it should be good enough.
The algorithm would be perfect if there were looping constructs in ed. But there aren’t. I might’ve used sed(1) for looping, but... I despise sed.

So I have my parsing rules happily chewing on Modal code. In Modal.ed, check it out! Thanks for lending me a ear and embracing the power of regex 🖤