Rose tree syntaxes and applications

Rose trees are ordered trees with one label per node; they are the native data model for many systems like Prolog and Mathematica. They are an extremely simple and flexible data model, one which might be suitable as an alternative to plain unstructured ASCII text in Unix pipelines.

Candidate syntaxes and graphical representations

I'm going to interleave these because I think it's important to start with examples, and the different perspectives illuminate one another.

Syntax option #0: Prolog

Here is an example rose tree, with the [possibly empty ordered list of] children of each node following its label (hereafter "tag") in parentheses, Prolog-style:

conf(network(eth0(ip(192.168.0.10()), 
                  mask(255.255.255.0()),
                  gw(192.168.0.1()))),
     hostname(crispin()))

We could drop the empty parentheses as redundant:

conf(network(eth0(ip(192.168.0.10), 
                  mask(255.255.255.0),
                  gw(192.168.0.1))),
     hostname(crispin))

This example is taken from OGDL, which is a rose-tree data language extended to support multiple references and cycles.

Syntax option #1: S-expressions

The above is clearly equivalent to a restricted class of S-expressions, because we can just move the tag inside the parentheses, and traditionally we also drop the commas.

(conf (network (eth0 (ip (192.168.0.10))
                     (mask (255.255.255.0))
                     (gw (192.168.0.10))))
      (hostname (crispin)))

This is a restricted class of S-expressions because the first item in each list is required to be an atom, and atoms can appear nowhere else. We could relax the second restriction by deeming that an atom appearing elsewhere represents a single-atom list, i.e., a node with no children:

(conf (network (eth0 (ip 192.168.0.10)
                     (mask 255.255.255.0)
                     (gw 192.168.0.10)))
      (hostname crispin))

However, this approach is aesthetically unappealing because it has restrictions (and optionally equivalencies) that are not natural in the S-expression syntax, which easily accommodates empty lists and lists whose first element is another list, both of which are forbidden.

Syntax option #2: indented outlines

You can represent the same structure more harmoniously (but less compactly) with indentation, like an outline; the children of each node are indented under it:

conf
    network
        eth0
            ip
                192.168.0.10
            mask
                255.255.255.0
            gw
                192.168.0.1
    hostname
        crispin

Graphical representations of rose trees

The tags in rose trees are where their content is. You can think of each node as having a tag, each child pointer in a node as having a tag, or each edge as having a tag. There is a subtle difference: a rose forest in the first interpretation corresponds to a rose tree in the other two, since in the other two the root node has no tag associated with it.

If we put the tags on the nodes, the above structure looks like this:

There would be a diagram here.

Note that the child pointers are ordered; they are not interchangeable. This corresponds to the OCaml declaration

type rosetree = Node of string * (rosetree list)

If we put the tags on the child pointers, the same rose tree instead looks like this:

There would be another diagram here.

And if we put the tags on the edges, it looks like this:

There would be a third diagram here.

These two correspond to

type rosetree = Node of (string * rosetree) list

Since this is a tree and not a DAG, we can also represent it by recursively subdividing a rectangular space, with the left part containing the tag and the right part containing the children, one on top of the other:

There would be a fourth diagram here.

Syntax option #3: an offside rule

This last visual representation suggests a more compact indentation-based teletype-ASCII syntax quite similar to OGDL, which would require some kind of quoting rule to include whitespace inside tag names (as the above syntax would to include newlines):

conf network eth0 ip 192.168.0.10
                  mask 255.255.255.0
                  gw 192.168.0.1
     hostname crispin

It isn't obvious to me how to concisely state the offside rule I'm using here; it's something like "to find the parent, go up to the most recent line that is less indented, then left to the rightmost tag that is less indented".

This same rule also handles the earlier outline-style indentation-based syntax, along with mixed formatting like the original OGDL or like this:

conf network
         eth0 ip 192.168.0.10
              mask 255.255.255.0
              gw 192.168.0.1
     hostname crispin

Unfortunately, it commits us to fixed-width fonts, and it's a pain to edit (imagine you wanted to rename conf to config in the above). That means it's not a viable option for general use.

Syntax option #4: elastic tabstops

Instead of teletype ASCII, we could use something like Nick Gravgaard's "elastic tabstops", in which each line is divided into fields by tabs, and all the non-last fields are constrained to the same width as the corresponding non-last field in the previous line, if there is one (and, since equality is symmetric, the same width as the corresponding non-last field in the next line, if there is one). The last field on each line is free to occupy any amount of space without imposing any constraints on other lines.

This gives us the same visual clarity as the offside rule without the editing hassles, at the expense of compatibility with teletype-oriented systems and maybe the possibility of blank lines.

Syntax option #5: Jevko-like parenthesized child nodes

An equivalent alternative based on Jevko (see file labeled-tree-programming-language.md) is syntax based on explicit delimiters:

conf (network (eth0 (ip (192.168.0.10)
                     mask (255.255.255.0)
                     gw (192.168.0.1)))
     hostname (crispin))

The syntax rules here are, informally:

  1. A node is represented by alternating tags and parenthesized child nodes.

  2. Unquoted tags can contain any characters other than quotes and parentheses, but leading and trailing whitespace in tags is ignored. In particular, tags can contain internal whitespace.

  3. Tags can instead be quoted with apostrophes in order to preserve leading and trailing whitespace and contain parentheses. Apostrophes are doubled SQL-style within quoted tags.

  4. Empty child nodes are inferred where a nonempty tag precedes another tag or a right parenthesis.

If used without editor support, this approach can lead to unappealing parenthesis-counting puzzles; the standard approach to solving this problem in C is to put the closing delimiters on a separate line, vertically aligned with either the opening delimiter:

conf (network (eth0 (ip (192.168.0.10)
                     mask (255.255.255.0)
                     gw (192.168.0.1)
                    )
              )
     hostname (crispin)
    )

or the text that led up to it:

conf (network (eth0 (ip (192.168.0.10)
                     mask (255.255.255.0)
                     gw (192.168.0.1)
               )
      )
     hostname (crispin)
)

which works better if the corresponding open delimiters end their lines:

conf (
    network (
        eth0 (
            ip (192.168.0.10)
            mask (255.255.255.0)
            gw (192.168.0.1)
        )
    )
    hostname (crispin)
)

These options still have the problem that it's too easy to accidentally close too many parentheses, or not enough, writing something like

conf (
    network (
        eth0 (
            ip (192.168.0.10)
            mask (255.255.255.0)
            gw (192.168.0.1)
        )
    hostname (crispin)
)
)

and there's no redundancy to help you find the error. The indentation-based syntaxes are much better in this sense.

A different sort of editor support for this would be to draw the parentheses vertically extended on the screen to embrace everything up to the matching parenthesis and, in the case of close parentheses, moved to the right. Here's my attempt at achieving this with CSS and HTML:

conf network eth0 ip 192.168.0.10
mask 255.255.255.0
gw 192.168.0.1

hostname crispin

This syntax provides a degree of layout flexibility absent not only from the offside-rule alternative, but also from the Jevko notation that inspired it, because you can use ignored whitespace to position the child nodes horizontally and vertically where you prefer.

This shares with options #2, #3, and #4 the desirable property that it is possible to concatenate two valid files and get another valid file, and with options #0, #1, #4, and #6 the desirable property that you can know when you've reached the end of a node's children.

Syntax option #6: XML

Please do not do anything like this:

<conf><network><eth0><ip>192.168.0.10</ip>
                     <mask>255.255.255.0</mask>
                     <gw>192.168.0.1</gw></eth0></network>
     <hostname>crispin</hostname>
</conf>

Although the semantic model is not far off (the children of an XML element are an ordered list, and each one has a tag if it's an element rather than a text node), and there is so much redundancy that errors are easy to report, this syntax is terrible. XML is designed for adding markup to text, not representing tree structures.

Mapping data semantics onto rose trees

(I'm going to mostly use the offside-rule syntax here because it's the calmest, even though, as I said, it's not that suitable for general use.)

Non-leaf tags as properties

In the above example, however it's represented syntactically, each of the nodes represents an entity, and most of the tags except the leaf tags represent the names of properties, fields, or attributes, drawn from a finite, predetermined set. Child nodes and leaf tags represent entity values. The network configuration being described above is presumably being navigated by something that knows that it's looking for network and ip. Moreover, as it happens, in the above example, tags are unique within each parent node; no node has two ip children, for example. This permits easy navigation by pathnames like conf.network.eth0.gw.

To the extent that this is true, we can add new child nodes to any non-leaf node with impunity, using new property names, without breaking any existing users:

conf network eth0 ip 192.168.0.10
                  mask 255.255.255.0
                  gw 192.168.0.1
                  mtu 1500
                  broadcast 192.168.0.255
                  kilroy was here
     hostname crispin
     kernel 5.17.2-24-generic
     here too

However, the tag eth0 is an exception; rather than a property name, it's the value of a primary key, the name of the Ethernet network interface; if we start adding siblings then their tags will presumably also be interpreted as network interface names. If we wanted to rigorously stick to the non-leaf-tags-are-properties rule, we'd end up with something like this:

conf network interface name eth0
                       ip 192.168.0.10
                       mask 255.255.255.0
                       gw 192.168.0.1
     hostname crispin

This means that if we have more than one network interface, our tags will no longer be unique within a parent:

conf network interface name eth0
                       ip 192.168.0.10
                       mask 255.255.255.0
                       gw 192.168.0.1
             interface name lo
                       ip 127.0.0.1
                       mask 255.0.0.0
     hostname crispin

This also requires a more elaborate query to access the information, maybe something like one of these:

~eth0 ~name gw

conf network interface [name=eth0] gw

conf network interface {name eth0} gw ?x

conf network interface name eth0
                       gw ?x

In this case, the order of the different interfaces and the ordering of properties within an interface isn't actually significant, but in cases where it is significant, your query language needs to handle that as well.

Top-level tags as N-ary relations

In Prolog, you might have predicates like interface/4 and route/4 to encode information like this positionally:

interface(eth0, "192.168.0.10", "255.255.255.0", "192.168.0.255").
interface(lo, "127.0.0.1", "255.0.0.0", "127.255.255.255").
route("0.0.0.0", "192.168.0.1", "0.0.0.0", eth0).

You can add new facts about the entities named by one or more of these values by inventing new predicates rather than trying to wedge new values into old trees (except the root).

These are of course rose trees and can be written with the offside rule with no extra punctuation:

interface eth0
          192.168.0.10
          255.255.255.0
          192.168.0.255

interface lo
          127.0.0.1
          255.0.0.0
          127.255.255.255

route 0.0.0.0
      192.168.0.1
      0.0.0.0
      eth0

But this is not very readable because of the low information density. The parenthesized-child-nodes syntax is more helpful here:

interface (eth0 '192.168.0.10' '255.255.255.0' '192.168.0.255')
interface (lo   '127.0.0.1'    '255.0.0.0'     '127.255.255.255')
route ('0.0.0.0' '192.168.0.1' '0.0.0.0' eth0)

This could have been written in that same syntax less readably as:

interface (eth0() 192.168.0.10() 255.255.255.0() 192.168.0.255)
interface (lo()   127.0.0.1()    255.0.0.0()     127.255.255.255)
route (0.0.0.0() 192.168.0.1() '0.0.0.0() eth0)

Nesting for more compact, tabular N-ary relations with the offside rule

Suppose that instead we chain or nest the various attributes of an interface as its first chld. Now our example looks like this:

conf network if eth0 192.168.0.10 255.255.255.0 192.168.0.255
             if lo   127.0.0.1    255.0.0.0     127.255.255.255
             route 0.0.0.0 192.168.0.1 0.0.0.0 eth0
     hostname crispin

Or we could put the interface chains as separate values under a single if parent node:

conf network if eth0 192.168.0.10 255.255.255.0 192.168.0.255
                lo   127.0.0.1    255.0.0.0     127.255.255.255
             route 0.0.0.0 192.168.0.1 0.0.0.0 eth0
     hostname crispin

With parenthesized child nodes this would look fairly appalling:

conf (network (if (eth0 (192.168.0.10 (255.255.255.0 (192.168.0.255))))
                  (lo   (127.0.0.1    (255.0.0.0     (127.255.255.255))))
              route (0.0.0.0 (192.168.0.1 (0.0.0.0 (eth0)))))
     (hostname (crispin)))

XXX that is rong

Properties with multiple values

Suppose that non-leaf tags are properties. A straightforward thing to do is to have more than one value for the same property on the same entity, departing from first normal form.

Books are a good example; MARC records have lots of multi-valued fields because Henriette Avram's MARC Pilot Project finished in 01968, but Codd didn't publish relational database theory until 01970.

Commonly a book has an author, a title, a publication year, a publisher, and an ISBN. But consider The Art of Electronics, or see the MARC record for its second edition.

book title 'The Art of Electronics'
     edition edition 3rd
             year 02015
             author 'Horowitz, Paul'
             author 'Hill, Winfield'
             publisher 'Cambridge University Press'
             oclc 904400036
             isbn 0521809266
             isbn 9780521809269
             pages 1192
             language en
             subject Electronics
             subject Electronic circuit design
     edition edition 2nd
             year 01995
             author 'Hill, Winfield'
             author 'Horowitz, Paul'
             publisher 'Cambridge University Press'
             subject Electronics
             subject Electronic circuit design
             pages 1125
             language en
             isbn 0521370957
             lccn 89000468
             loc TK7815 .H67 1989

Here's that same data in the parenthesized-child-nodes syntax, which permits more flexibility in layout:

book (title (The Art of Electronics)
      edition (edition (3rd)  year (02015)  pages (1192)  language (en)
               author (Horowitz, Paul)   subject (Electronic circuit design)
               author (Hill, Winfield)   subject (Electronics)
               publisher (Cambridge University Press)
               oclc (904400036)  isbn (0521809266)  isbn (9780521809269))
      edition (edition (2nd)  year (01995)  pages (1125)  language (en)
               author (Hill, Winfield)   subject (Electronics)
               author (Horowitz, Paul)   subject (Electronic circuit design)
               publisher (Cambridge University Press)
               isbn (0521370957)  lccn (89000468)  loc (TK7815 .H67 1989)))

One-to-many relations here include book-to-edition and edition-to-ISBN, and many-to-many relations include book-to-title, edition-to-author, and edition-to-subject. It's hard to have faith we've gotten them all. And data that may be missing is potentially everything: not every edition has a known publication year or author or ISBN, etc. So it's nice to not have to get this stuff right up front.

On the other hand, experience with MongoDB and JSON has shown that it's often a real pain to deal with a database where you don't even really know its structure.

Column-oriented storage

Instead of having nodes represent entities with tags within them representing attributes, you can of course have nodes represent attributes with tags within them representing entities. Perhaps more interesting is having a node represent an attribute and then have a long sequence of leaf children whose positional index tells you which entity their value pertains to. I include this for completeness but it probably is too much of a pain to hand-edit.

Applications for a rose-tree data format

So, you can represent anything with rose trees, just as you can represent anything with S-expressions, or for that matter XML or SQL dumps. What would you want to represent that way?

Free-form databases

Consider the book example above. I think it's fairly human-editable, especially with some editor support. It's easy to imagine typing in a database of items of interest (MP3 files, books, research papers, web bookmarks) and doing fielded search and more complex database queries over it.

Having a reasonably readable, flexible, machine-readable, and recursive syntax for these things means that it's possible to use the same syntax for data-entry screens, data-query screens, reports, and data storage. Specialized formats can do these things better, but a baseline format that is adequate for all four gives you a much simpler system to start with.

It should be possible to do somewhat-generic autocomplete for this sort of thing. If you're in a book node like the example above, your database editor might look at other book nodes and see that they commonly have title and edition properties, offering these as autocompletes; if you add a book.edition perhaps it could offer to paste another recent book.edition structure as a template; and in a book.edition.subject it ought to offer as autocompletes strings that commonly occur in book.edition.subject, as well as a key to add an additional subject, since book.edition often has more than one subject. Of course you could do better by explicitly writing code that configures these things, but it should be possible to provide reasonable defaults.

It would also be great to attach calculated fields and alternate presentations to this sort of thing. Probably the canonical example of a calculated field would be updating a amt field every time a qty or unitprice field is updated in an invoice.lineitem context, and adding a total to the invoice with a sum of all the amt fields. Alternate presentations might include things like matrixes of scatterplots of data frames, as well as of course pleasantly-laid-out screens.

Human- and machine-readable output from things that aren't databases

One of the nice things about the Unix shell environment is that the output from every command is machine-readable. But it's not very machine-readable; often it's machine-readable only unreliably, or after a lot of wrangling regular expressions, and the output formats are difficult to extend without breaking compatibility with previous consumers. Often commands resort to one hack or another to compromise between human-readability and machine-readability: the invisible filename separators of GNU find -0, the single-column output of ls when writing to a pipe, the conditional suppression of coloring escape sequences in GNU grep --color=auto.

You could imagine an environment where the output of all your commands was represented in this rose-tree format, enabling reliable and information-rich machine-readability with backward-compatibility, with the possibility of your terminal using the data-type tagging to look up both visual presentation options and possible user interactions, but having a concise and unambiguous text representation to gracefully degrade to in the case of unknown tags.

Ode to grep

Consider this example from Emacs:

grep --color -nH --null -e OGDL *
rose-tree-syntaxes.md:35:This example is taken from [OGDL](https://ogdl.org/), which is a
rose-tree-syntaxes.md:127:indentation-based teletype-ASCII syntax quite similar to OGDL, which

The first colon on these lines is actually an ASCII NUL, which Emacs displays as a colon in order to avoid problems with filenames containing colons. Emacs parses out the line number and highlights it, as well as making the line a clickable hyperlink to the relevant line in the file; but also --color here actually means --color=auto which doesn't generate any actual color. Wouldn't it be better to have grep's output semantically tagged with something like

ghit(fn(rose-tree-syntaxes.md) ln(35)
    <(This example is taken from [) &(OGDL) >('](https://ogdl.org/), which is a'))

with the filename, line number, pre-match, match, and post-match each tagged separately? Later versions of grep could add more information to this without breaking compatibility, it wouldn't have to choose between regular readable output and reliably parseable output, you wouldn't have to remember that -o is the option to only print out the matched string, and your terminal could be configured to highlight the ghit & in a different color and hide the ghit fn unless asked.

By feeding the output to a sort program you could sort the ghits by filename or by the text following the match or whatever.

In this case the same format would serve for TAGS files, whose records include byte positions, line numbers, excerpted text, and identifiers believed to be defined there, and are grouped into groups by file.

With a slightly different grep format that groups the ghits by filename, the terminal could also offer you the ability to collapse the results for a particular file; it is a common problem with grep that you have one humongous file that contains almost all your hits, all spurious.

This output format is a little more verbose but in this case the structure overhead is only 30%, even with the extra whitespace.

Command-line calling interfaces

Above I mentioned using rose trees for data entry forms, but it occurred to me that command lines are also sort of form-like, and you could reasonably have forms for common command-lines that you tab around in.

The commands on OS/400 were similar to this:

RNMOBJ OBJ(OLDNAME) OBJTYPE(*FILE) NEWOBJ(NEWNAME)

In GNU syntax this would be

rnmobj --obj oldname --objtype file --newobj newname

though in fact GNU mv takes positional parameters rather than named ones. In the parenthesized-child-node syntax it looks almost the same as OS/400:

rnmobj(obj(oldname) objtype(*file) newobj(newname))

but also if you hit F4 you would get an honest-to-God form to fill out with the command-line parameters in form fields, with help displayed for each field. However, the parameter names for the form fields would be hidden; you'd have to hit F9 after invoking the command to recall the one-line version of what you'd created in the form.

Many CGI scripts will provide you with an HTML form to invoke them with when invoked without arguments. Perhaps command-line programs should do the same thing, and the manual page should include many command forms to edit and invoke.

So maybe running grep without arguments would spit out

grep(regexp(foo.*bar) files(*.c) ⌄)

with your cursor positioned in the regexp field, and if you tabbed over to the it would reveal the other grep options.

Structured logging

Large nested structured records are useful in tracking down operational problems or plotting metrics over time.

Configuration files

OGDL originated as an effort to simplify YAML, which is commonly used for configuration files. Rose trees seem like a reasonable data model for configuration files. And if your configuration files are parseable, you can query them or reliably update them in a program.

Term-rewriting programming languages

For some things, rose trees are probably the best data format. For my simple term-rewriting language Qfitzah (see file term-rewriting-micro-interpreter.md in Dernocua) I wanted to input my S-expression rewrite rules like this, without any parentheses around the outermost list:

And x y: If x y x

This means "Rewrite a list whose first element is And and with two other elements to a list whose first element is If and contains the same two elements followed by the first one again." Then there were rules for If Yes a b and If No a b, etc. Lines without a colon represented S-expressions to try to rewrite with all the existing rules rather than new rewrite rules.

The problem I ran into with this is that it was unclear how to interpret a line saying something like

Not Yes: No

The right-hand side is a single atom. The idea is that if you have at some point a list saying something like If (Not Yes) foo bar then it should get rewritten to If No foo bar. But I was getting If (No) foo bar, because if the outermost parentheses are omitted, there's no syntactic way to distinguish the bare atom No from the single-atom list (No).

I gave up on the readable syntax above and went with

(Not Yes) No

but rose trees eliminate the problematic distinction: No() is the same as No. Also I think this might be a more explicit syntax, though it certainly isn't easier to read:

rewrite(And(?(x) ?(y))) to(If(?(x) ?(y) ?(x)))

For efficient dispatch in term-rewriting languages, it is probably valuable to require that the head of a clause being matched not be a variable. (This doesn't result in the same reduction of expressivity as in functional and imperative languages.) If your rewrite rules are expressed in terms of rose trees, the places you can put variables are the nodes of a tree, not its tags, so that restriction falls out naturally, as it does in Prolog.

Tuple spaces

For a similar reason, I suspect tuple spaces benefit from having a distinguished field for tags which must be filled when applying an in() or rd() operation; you don't want tuples you stick into a tuple space to get vacuumed up by some piece of code that is just matching all tuples. Almost invariably in the early tuple-space literature the first element of every tuple being matched is a literal string.