Rose trees are ordered trees with one label per node; they are the native data model for many systems like Prolog and Mathematica. They are an extremely simple and flexible data model, one which might be suitable as an alternative to plain unstructured ASCII text in Unix pipelines.
I'm going to interleave these because I think it's important to start with examples, and the different perspectives illuminate one another.
Here is an example rose tree, with the [possibly empty ordered list of] children of each node following its label (hereafter "tag") in parentheses, Prolog-style:
conf(network(eth0(ip(192.168.0.10()),
mask(255.255.255.0()),
gw(192.168.0.1()))),
hostname(crispin()))
We could drop the empty parentheses as redundant:
conf(network(eth0(ip(192.168.0.10),
mask(255.255.255.0),
gw(192.168.0.1))),
hostname(crispin))
This example is taken from OGDL, which is a rose-tree data language extended to support multiple references and cycles.
The above is clearly equivalent to a restricted class of S-expressions, because we can just move the tag inside the parentheses, and traditionally we also drop the commas.
(conf (network (eth0 (ip (192.168.0.10))
(mask (255.255.255.0))
(gw (192.168.0.10))))
(hostname (crispin)))
This is a restricted class of S-expressions because the first item in each list is required to be an atom, and atoms can appear nowhere else. We could relax the second restriction by deeming that an atom appearing elsewhere represents a single-atom list, i.e., a node with no children:
(conf (network (eth0 (ip 192.168.0.10)
(mask 255.255.255.0)
(gw 192.168.0.10)))
(hostname crispin))
However, this approach is aesthetically unappealing because it has restrictions (and optionally equivalencies) that are not natural in the S-expression syntax, which easily accommodates empty lists and lists whose first element is another list, both of which are forbidden.
You can represent the same structure more harmoniously (but less compactly) with indentation, like an outline; the children of each node are indented under it:
conf
network
eth0
ip
192.168.0.10
mask
255.255.255.0
gw
192.168.0.1
hostname
crispin
The tags in rose trees are where their content is. You can think of each node as having a tag, each child pointer in a node as having a tag, or each edge as having a tag. There is a subtle difference: a rose forest in the first interpretation corresponds to a rose tree in the other two, since in the other two the root node has no tag associated with it.
If we put the tags on the nodes, the above structure looks like this:
Note that the child pointers are ordered; they are not interchangeable. This corresponds to the OCaml declaration
type rosetree = Node of string * (rosetree list)
If we put the tags on the child pointers, the same rose tree instead looks like this:
And if we put the tags on the edges, it looks like this:
These two correspond to
type rosetree = Node of (string * rosetree) list
Since this is a tree and not a DAG, we can also represent it by recursively subdividing a rectangular space, with the left part containing the tag and the right part containing the children, one on top of the other:
This last visual representation suggests a more compact indentation-based teletype-ASCII syntax quite similar to OGDL, which would require some kind of quoting rule to include whitespace inside tag names (as the above syntax would to include newlines):
conf network eth0 ip 192.168.0.10
mask 255.255.255.0
gw 192.168.0.1
hostname crispin
It isn't obvious to me how to concisely state the offside rule I'm using here; it's something like "to find the parent, go up to the most recent line that is less indented, then left to the rightmost tag that is less indented".
This same rule also handles the earlier outline-style indentation-based syntax, along with mixed formatting like the original OGDL or like this:
conf network
eth0 ip 192.168.0.10
mask 255.255.255.0
gw 192.168.0.1
hostname crispin
Unfortunately, it commits us to fixed-width fonts, and it's a pain to
edit (imagine you wanted to rename conf
to config
in the above).
That means it's not a viable option for general use.
Instead of teletype ASCII, we could use something like Nick Gravgaard's "elastic tabstops", in which each line is divided into fields by tabs, and all the non-last fields are constrained to the same width as the corresponding non-last field in the previous line, if there is one (and, since equality is symmetric, the same width as the corresponding non-last field in the next line, if there is one). The last field on each line is free to occupy any amount of space without imposing any constraints on other lines.
This gives us the same visual clarity as the offside rule without the editing hassles, at the expense of compatibility with teletype-oriented systems and maybe the possibility of blank lines.
An equivalent alternative based on Jevko (see file
labeled-tree-programming-language.md
) is syntax based on explicit
delimiters:
conf (network (eth0 (ip (192.168.0.10)
mask (255.255.255.0)
gw (192.168.0.1)))
hostname (crispin))
The syntax rules here are, informally:
A node is represented by alternating tags and parenthesized child nodes.
Unquoted tags can contain any characters other than quotes and parentheses, but leading and trailing whitespace in tags is ignored. In particular, tags can contain internal whitespace.
Tags can instead be quoted with apostrophes in order to preserve leading and trailing whitespace and contain parentheses. Apostrophes are doubled SQL-style within quoted tags.
Empty child nodes are inferred where a nonempty tag precedes another tag or a right parenthesis.
If used without editor support, this approach can lead to unappealing parenthesis-counting puzzles; the standard approach to solving this problem in C is to put the closing delimiters on a separate line, vertically aligned with either the opening delimiter:
conf (network (eth0 (ip (192.168.0.10)
mask (255.255.255.0)
gw (192.168.0.1)
)
)
hostname (crispin)
)
or the text that led up to it:
conf (network (eth0 (ip (192.168.0.10)
mask (255.255.255.0)
gw (192.168.0.1)
)
)
hostname (crispin)
)
which works better if the corresponding open delimiters end their lines:
conf (
network (
eth0 (
ip (192.168.0.10)
mask (255.255.255.0)
gw (192.168.0.1)
)
)
hostname (crispin)
)
These options still have the problem that it's too easy to accidentally close too many parentheses, or not enough, writing something like
conf (
network (
eth0 (
ip (192.168.0.10)
mask (255.255.255.0)
gw (192.168.0.1)
)
hostname (crispin)
)
)
and there's no redundancy to help you find the error. The indentation-based syntaxes are much better in this sense.
A different sort of editor support for this would be to draw the parentheses vertically extended on the screen to embrace everything up to the matching parenthesis and, in the case of close parentheses, moved to the right. Here's my attempt at achieving this with CSS and HTML:
This syntax provides a degree of layout flexibility absent not only from the offside-rule alternative, but also from the Jevko notation that inspired it, because you can use ignored whitespace to position the child nodes horizontally and vertically where you prefer.
This shares with options #2, #3, and #4 the desirable property that it is possible to concatenate two valid files and get another valid file, and with options #0, #1, #4, and #6 the desirable property that you can know when you've reached the end of a node's children.
Please do not do anything like this:
<conf><network><eth0><ip>192.168.0.10</ip>
<mask>255.255.255.0</mask>
<gw>192.168.0.1</gw></eth0></network>
<hostname>crispin</hostname>
</conf>
Although the semantic model is not far off (the children of an XML element are an ordered list, and each one has a tag if it's an element rather than a text node), and there is so much redundancy that errors are easy to report, this syntax is terrible. XML is designed for adding markup to text, not representing tree structures.
(I'm going to mostly use the offside-rule syntax here because it's the calmest, even though, as I said, it's not that suitable for general use.)
In the above example, however it's represented syntactically, each of
the nodes represents an entity, and most of the tags except the leaf
tags represent the names of properties, fields, or attributes, drawn
from a finite, predetermined set. Child nodes and leaf tags represent
entity values. The network configuration being described above is
presumably being navigated by something that knows that it's looking
for network
and ip
. Moreover, as it happens, in the above
example, tags are unique within each parent node; no node has two ip
children, for example. This permits easy navigation by pathnames like
conf.network.eth0.gw
.
To the extent that this is true, we can add new child nodes to any non-leaf node with impunity, using new property names, without breaking any existing users:
conf network eth0 ip 192.168.0.10
mask 255.255.255.0
gw 192.168.0.1
mtu 1500
broadcast 192.168.0.255
kilroy was here
hostname crispin
kernel 5.17.2-24-generic
here too
However, the tag eth0
is an exception; rather than a property name,
it's the value of a primary key, the name of the Ethernet network
interface; if we start adding siblings then their tags will presumably
also be interpreted as network interface names. If we wanted to
rigorously stick to the non-leaf-tags-are-properties rule, we'd end up
with something like this:
conf network interface name eth0
ip 192.168.0.10
mask 255.255.255.0
gw 192.168.0.1
hostname crispin
This means that if we have more than one network interface, our tags will no longer be unique within a parent:
conf network interface name eth0
ip 192.168.0.10
mask 255.255.255.0
gw 192.168.0.1
interface name lo
ip 127.0.0.1
mask 255.0.0.0
hostname crispin
This also requires a more elaborate query to access the information, maybe something like one of these:
~eth0 ~name gw
conf network interface [name=eth0] gw
conf network interface {name eth0} gw ?x
conf network interface name eth0
gw ?x
In this case, the order of the different interfaces and the ordering of properties within an interface isn't actually significant, but in cases where it is significant, your query language needs to handle that as well.
In Prolog, you might have predicates like interface/4
and route/4
to
encode information like this positionally:
interface(eth0, "192.168.0.10", "255.255.255.0", "192.168.0.255").
interface(lo, "127.0.0.1", "255.0.0.0", "127.255.255.255").
route("0.0.0.0", "192.168.0.1", "0.0.0.0", eth0).
You can add new facts about the entities named by one or more of these values by inventing new predicates rather than trying to wedge new values into old trees (except the root).
These are of course rose trees and can be written with the offside rule with no extra punctuation:
interface eth0
192.168.0.10
255.255.255.0
192.168.0.255
interface lo
127.0.0.1
255.0.0.0
127.255.255.255
route 0.0.0.0
192.168.0.1
0.0.0.0
eth0
But this is not very readable because of the low information density. The parenthesized-child-nodes syntax is more helpful here:
interface (eth0 '192.168.0.10' '255.255.255.0' '192.168.0.255')
interface (lo '127.0.0.1' '255.0.0.0' '127.255.255.255')
route ('0.0.0.0' '192.168.0.1' '0.0.0.0' eth0)
This could have been written in that same syntax less readably as:
interface (eth0() 192.168.0.10() 255.255.255.0() 192.168.0.255)
interface (lo() 127.0.0.1() 255.0.0.0() 127.255.255.255)
route (0.0.0.0() 192.168.0.1() '0.0.0.0() eth0)
Suppose that instead we chain or nest the various attributes of an interface as its first chld. Now our example looks like this:
conf network if eth0 192.168.0.10 255.255.255.0 192.168.0.255
if lo 127.0.0.1 255.0.0.0 127.255.255.255
route 0.0.0.0 192.168.0.1 0.0.0.0 eth0
hostname crispin
Or we could put the interface chains as separate values under a single
if
parent node:
conf network if eth0 192.168.0.10 255.255.255.0 192.168.0.255
lo 127.0.0.1 255.0.0.0 127.255.255.255
route 0.0.0.0 192.168.0.1 0.0.0.0 eth0
hostname crispin
With parenthesized child nodes this would look fairly appalling:
conf (network (if (eth0 (192.168.0.10 (255.255.255.0 (192.168.0.255))))
(lo (127.0.0.1 (255.0.0.0 (127.255.255.255))))
route (0.0.0.0 (192.168.0.1 (0.0.0.0 (eth0)))))
(hostname (crispin)))
XXX that is rong
Suppose that non-leaf tags are properties. A straightforward thing to do is to have more than one value for the same property on the same entity, departing from first normal form.
Books are a good example; MARC records have lots of multi-valued fields because Henriette Avram's MARC Pilot Project finished in 01968, but Codd didn't publish relational database theory until 01970.
Commonly a book has an author, a title, a publication year, a publisher, and an ISBN. But consider The Art of Electronics, or see the MARC record for its second edition.
book title 'The Art of Electronics'
edition edition 3rd
year 02015
author 'Horowitz, Paul'
author 'Hill, Winfield'
publisher 'Cambridge University Press'
oclc 904400036
isbn 0521809266
isbn 9780521809269
pages 1192
language en
subject Electronics
subject Electronic circuit design
edition edition 2nd
year 01995
author 'Hill, Winfield'
author 'Horowitz, Paul'
publisher 'Cambridge University Press'
subject Electronics
subject Electronic circuit design
pages 1125
language en
isbn 0521370957
lccn 89000468
loc TK7815 .H67 1989
Here's that same data in the parenthesized-child-nodes syntax, which permits more flexibility in layout:
book (title (The Art of Electronics)
edition (edition (3rd) year (02015) pages (1192) language (en)
author (Horowitz, Paul) subject (Electronic circuit design)
author (Hill, Winfield) subject (Electronics)
publisher (Cambridge University Press)
oclc (904400036) isbn (0521809266) isbn (9780521809269))
edition (edition (2nd) year (01995) pages (1125) language (en)
author (Hill, Winfield) subject (Electronics)
author (Horowitz, Paul) subject (Electronic circuit design)
publisher (Cambridge University Press)
isbn (0521370957) lccn (89000468) loc (TK7815 .H67 1989)))
One-to-many relations here include book-to-edition and edition-to-ISBN, and many-to-many relations include book-to-title, edition-to-author, and edition-to-subject. It's hard to have faith we've gotten them all. And data that may be missing is potentially everything: not every edition has a known publication year or author or ISBN, etc. So it's nice to not have to get this stuff right up front.
On the other hand, experience with MongoDB and JSON has shown that it's often a real pain to deal with a database where you don't even really know its structure.
Instead of having nodes represent entities with tags within them representing attributes, you can of course have nodes represent attributes with tags within them representing entities. Perhaps more interesting is having a node represent an attribute and then have a long sequence of leaf children whose positional index tells you which entity their value pertains to. I include this for completeness but it probably is too much of a pain to hand-edit.
So, you can represent anything with rose trees, just as you can represent anything with S-expressions, or for that matter XML or SQL dumps. What would you want to represent that way?
Consider the book example above. I think it's fairly human-editable, especially with some editor support. It's easy to imagine typing in a database of items of interest (MP3 files, books, research papers, web bookmarks) and doing fielded search and more complex database queries over it.
Having a reasonably readable, flexible, machine-readable, and recursive syntax for these things means that it's possible to use the same syntax for data-entry screens, data-query screens, reports, and data storage. Specialized formats can do these things better, but a baseline format that is adequate for all four gives you a much simpler system to start with.
It should be possible to do somewhat-generic autocomplete for this
sort of thing. If you're in a book
node like the example above,
your database editor might look at other book
nodes and see that
they commonly have title
and edition
properties, offering these as
autocompletes; if you add a book.edition
perhaps it could offer to
paste another recent book.edition
structure as a template; and in a
book.edition.subject
it ought to offer as autocompletes strings that
commonly occur in book.edition.subject
, as well as a key to add an
additional subject
, since book.edition
often has more than one
subject
. Of course you could do better by explicitly writing code
that configures these things, but it should be possible to provide
reasonable defaults.
It would also be great to attach calculated fields and alternate
presentations to this sort of thing. Probably the canonical example
of a calculated field would be updating a amt
field every time a
qty
or unitprice
field is updated in an invoice.lineitem
context, and adding a total
to the invoice
with a sum of all the
amt
fields. Alternate presentations might include things like
matrixes of scatterplots of data frames, as well as of course
pleasantly-laid-out screens.
One of the nice things about the Unix shell environment is that the
output from every command is machine-readable. But it's not very
machine-readable; often it's machine-readable only unreliably, or
after a lot of wrangling regular expressions, and the output formats
are difficult to extend without breaking compatibility with previous
consumers. Often commands resort to one hack or another to compromise
between human-readability and machine-readability: the invisible
filename separators of GNU find -0
, the single-column output of ls
when writing to a pipe, the conditional suppression of coloring escape
sequences in GNU grep --color=auto
.
You could imagine an environment where the output of all your commands was represented in this rose-tree format, enabling reliable and information-rich machine-readability with backward-compatibility, with the possibility of your terminal using the data-type tagging to look up both visual presentation options and possible user interactions, but having a concise and unambiguous text representation to gracefully degrade to in the case of unknown tags.
Consider this example from Emacs:
grep --color -nH --null -e OGDL *
rose-tree-syntaxes.md:35:This example is taken from [OGDL](https://ogdl.org/), which is a
rose-tree-syntaxes.md:127:indentation-based teletype-ASCII syntax quite similar to OGDL, which
The first colon on these lines is actually an ASCII NUL, which Emacs
displays as a colon in order to avoid problems with filenames
containing colons. Emacs parses out the line number and highlights
it, as well as making the line a clickable hyperlink to the relevant
line in the file; but also --color
here actually means
--color=auto
which doesn't generate any actual color. Wouldn't it
be better to have grep's output semantically tagged with something
like
ghit(fn(rose-tree-syntaxes.md) ln(35)
<(This example is taken from [) &(OGDL) >('](https://ogdl.org/), which is a'))
with the filename, line number, pre-match, match, and post-match each
tagged separately? Later versions of grep could add more information
to this without breaking compatibility, it wouldn't have to choose
between regular readable output and reliably parseable output, you
wouldn't have to remember that -o
is the option to only print out
the matched string, and your terminal could be configured to highlight
the ghit &
in a different color and hide the ghit fn
unless asked.
By feeding the output to a sort program you could sort the ghits by filename or by the text following the match or whatever.
In this case the same format would serve for TAGS files, whose records include byte positions, line numbers, excerpted text, and identifiers believed to be defined there, and are grouped into groups by file.
With a slightly different grep format that groups the ghits by filename, the terminal could also offer you the ability to collapse the results for a particular file; it is a common problem with grep that you have one humongous file that contains almost all your hits, all spurious.
This output format is a little more verbose but in this case the structure overhead is only 30%, even with the extra whitespace.
Above I mentioned using rose trees for data entry forms, but it occurred to me that command lines are also sort of form-like, and you could reasonably have forms for common command-lines that you tab around in.
The commands on OS/400 were similar to this:
RNMOBJ OBJ(OLDNAME) OBJTYPE(*FILE) NEWOBJ(NEWNAME)
In GNU syntax this would be
rnmobj --obj oldname --objtype file --newobj newname
though in fact GNU mv
takes positional parameters rather than named
ones. In the parenthesized-child-node syntax it looks almost the same
as OS/400:
rnmobj(obj(oldname) objtype(*file) newobj(newname))
but also if you hit F4 you would get an honest-to-God form to fill out with the command-line parameters in form fields, with help displayed for each field. However, the parameter names for the form fields would be hidden; you'd have to hit F9 after invoking the command to recall the one-line version of what you'd created in the form.
Many CGI scripts will provide you with an HTML form to invoke them with when invoked without arguments. Perhaps command-line programs should do the same thing, and the manual page should include many command forms to edit and invoke.
So maybe running grep
without arguments would spit out
grep(regexp(foo.*bar) files(*.c) ⌄)
with your cursor positioned in the regexp field, and if you tabbed
over to the ⌄
it would reveal the other grep options.
Large nested structured records are useful in tracking down operational problems or plotting metrics over time.
OGDL originated as an effort to simplify YAML, which is commonly used for configuration files. Rose trees seem like a reasonable data model for configuration files. And if your configuration files are parseable, you can query them or reliably update them in a program.
For some things, rose trees are probably the best data format. For my
simple term-rewriting language Qfitzah (see file
term-rewriting-micro-interpreter.md
in Dernocua) I wanted to input
my S-expression rewrite rules like this, without any parentheses
around the outermost list:
And x y: If x y x
This means "Rewrite a list whose first element is And
and with two
other elements to a list whose first element is If
and contains the
same two elements followed by the first one again." Then there were
rules for If Yes a b
and If No a b
, etc. Lines without a colon
represented S-expressions to try to rewrite with all the existing
rules rather than new rewrite rules.
The problem I ran into with this is that it was unclear how to interpret a line saying something like
Not Yes: No
The right-hand side is a single atom. The idea is that if you have at
some point a list saying something like If (Not Yes) foo bar
then it
should get rewritten to If No foo bar
. But I was getting If (No)
foo bar
, because if the outermost parentheses are omitted, there's no
syntactic way to distinguish the bare atom No
from the single-atom
list (No)
.
I gave up on the readable syntax above and went with
(Not Yes) No
but rose trees eliminate the problematic distinction: No()
is the
same as No
. Also I think this might be a more explicit syntax,
though it certainly isn't easier to read:
rewrite(And(?(x) ?(y))) to(If(?(x) ?(y) ?(x)))
For efficient dispatch in term-rewriting languages, it is probably valuable to require that the head of a clause being matched not be a variable. (This doesn't result in the same reduction of expressivity as in functional and imperative languages.) If your rewrite rules are expressed in terms of rose trees, the places you can put variables are the nodes of a tree, not its tags, so that restriction falls out naturally, as it does in Prolog.
For a similar reason, I suspect tuple spaces benefit from having a distinguished field for tags which must be filled when applying an in() or rd() operation; you don't want tuples you stick into a tuple space to get vacuumed up by some piece of code that is just matching all tuples. Almost invariably in the early tuple-space literature the first element of every tuple being matched is a literal string.