Binate: A very compact database query language based on binary relations

The theory of n-ary relations being so agreeably complete and satisfying as a basis for well-designed database languages, why would anybody nowadays think of elevating the binary relation in particular to some special position?” — Hugh Darwen, 2005

Here’s my problem, expressed in Prolog. I have two basic relations to start with:

Here’s an unfactored (and untested) Prolog solution, which is valid Datalog:

visits("9", "b", "4", Route), 
square("16", "c", "4", X, Y), 
visits(Page, Column, Row, Route), 
square(Page, Column, Row, X2, Y2),
X - 1 =< X2, X + 1 >= X2,  
Y - 1 =< Y2, Y + 1 >= Y2.

A nicely-factored Prolog solution looks like this. First, some auxiliary predicates neighbor/2, visits/2, and route_between/3:

near(A, B) :- A - 1 =< B, B =< A + 1.
neighbor((Page, Column, Row), (Page2, Column2, Row2)) :-
  square(Page,  Column,  Row,  X,  Y),
  square(Page2, Column2, Row2, X2, Y2),
  near(X, X2), near(Y, Y2).

visits((Page, Column, Row), Route) :- visits(Page, Column, Row, Route).
route_between(A, B, Route) :- visits(A, Route), visits(B, Route).

Then here’s the actual query:

route_between(("9", "b", "4"), Place, Route), neighbor(("16", "c", "4"), Place).

The Prolog data model is basically the relational data model: relations like visits/4 contain tuples, which are distinguished by their values.

A few years back I wrote about “mvfs” or “multivalued functions”, which if I had known anything about logic, I would have known were called “binary relations”. It turns out that they offer a somewhat simpler solution to this problem, which looks like this:

succ = (i; 1) sum.
near = succ ~=<, ~succ =<.
neighbor = x near ~x, y near ~y.
connected_by = first route, second route.
connected_to_neighbor_by = (first; second neighbor) connected_by.

And then the query is:

(("9", "b", "4"), ("16", "c", "4")).connected_to_neighbor_by

This is considerably more compact than the Prolog version, although I am not yet sure that it is clearer. The rest of this document explains the query language in some detail.

The SQL solution is monstrous

This is the SQL solution I started from, before running to the Prolog above:

select
  distinct a.route
from
  visits a
  join visits b using (route)
  join square c using (page, "column", "row"),
  square d
where
  a.page = '9' and a."column" = 'b' and a."row" = '4' and
  c.x between d.x-1 and d.x+1 and
  c.y between d.y-1 and d.y+1 and
  d.page = '16' and d."column" = 'c' and d."row" = '4';

It is possible to factor out the neighbor and route_between relations into separate SQL views, but the cost in code volume is horrendous, and it doesn't even help the original query much:

create view neighbor as
select
  a.page as page, a."column" as "column", a."row" as "row",
  b.page as bpage, b."column" as bcolumn, b."row" as brow
from
  square a, square b
where
  a.x between b.x-1 and b.x+1
  and a.y between b.y-1 and b.y+1;

create view routes_between as
select
  a.page as page, a."column" as "column", a."row" as "row",
  b.page as bpage, b."column" as bcolumn, b."row" as brow,
  route
from
  visits a join visits b using (route);

select
  distinct a.route
from
  routes_between a
  join neighbor c using (page, "column", "row")
where
  a.bpage = '9' and a.bcolumn = 'b' and a.brow = '4' and
  c.bpage = '16' and c."bcolumn" = 'c' and c."brow" = '4';

So I went looking for the notation I needed to make this query simpler, which I have presented an example of in the previous section; it allows me to do everything in this huge glob of SQL in just a few short lines. Here’s a more complete explanation.

An algebra of binary relations that expresses this query more tersely

Here I explain in three parallel notations what I am talking about:

Common examples of binary relations include “is greater than”, “is a parent of”, “is the cosine of”, and “is the first name of”.

A binary relation has a “source” domain (sometimes called just its “domain”) and a “target” domain (sometimes called its “codomain”). In the case of "is the first name of", the source domain is names and the target domain is people. In the case that the binary relation is a function, the source domain is the function’s domain, and the target domain is the function’s range. The statement that the relation R holds between the source-domain element a and the target-domain element b, is normally written (e.g. in An Algebra of Programming) b R a. For any pair of elements (a, b) from the relevant domains, b R a is either true or false. In Prolog notation, I would normally write this r(a, b), but in order to make my Prolog definitions below make sense, I will write it rel(r, a, b).. Note that the order of a and b is reversed here from the mathematical notation.

You can think of the relation as a set of ordered pairs (a, b) where a is an element of the source domain and b is an element of the target domain.

You can also think of a relation as a multivalued attribute: b R a means that among the Rs of a is b; b is an R of a. For example, barackObama parent sashaObama means that barackObama is a parent of sashaObama. Under this interpretation, the relation represents a set-valued property of every object in its source domain.

In my ASCII syntax, I will write barackObama parent sashaObama as sashaObama.parent = barackObama. This may be misleading, because it is simultaneously true that sashaObama.parent = michelleObama. If you have a better notation, write me.

I’m going to use four fundamental operations to build up binary relations from simpler ones: composition, converse (which I used to call “inverse”), intersection (normally called “meet”), and relational product (which I used to call “join”).

Composition

The composition of the operations R and S is normally written R • S. b (R • S) c iff ∃a: b R a ∧ a S c. In Prolog, the arguments are reversed:

 rel(compose(S, R), C, B) :- rel(S, C, A), rel(R, A, B).

So, for example, if it is the case that "Barack" firstName barackObama and barackObama parent sashaObama, then it is the case that "Barack" (firstName • parent) sashaObama. In Prolog, because:

rel(firstName, barackObama, "Barack").
rel(parent, sashaObama, barackObama).

we can conclude:

rel(compose(parent, firstName), sashaObama, "Barack").

In my ASCII notation, this relation is written simply with concatenation, just as parent firstName, so sashaObama.(parent firstName) = "Barack". We can think of parent firstName as being a binary relation containing just the pair (sashaObama, "Barack"), given only the pairs listed above in the firstName and parent relations.

Converse

The converse of a relation R is written R°; a R° b iff b R a. For functions, this is the inverse of the function, but the operation is closed on binary relations — the converse of a binary relation is always a binary relation — while it’s not closed on functions, because the inverse of a function may be a binary relation that isn’t a function.

In Prolog:

rel(converse(R), B, A) :- rel(R, A, B).

In my ASCII notation, I write it with ~. Given the parent relation mentioned earlier, we can write the child relation as ~parent. If these are true:

sashaObama.parent = barackObama.
sashaObama.parent = michelleObama.

Then these are also true:

barackObama.~parent = sashaObama.
michelleObama.~parent = sashaObama.

If we use = to give names to relations as well, we can write:

sashaObama.parent = barackObama.
sashaObama.parent = michelleObama.
child = ~parent.

Then these are true:

barackObama.child = sashaObama.
michelleObama.child = sashaObama.

This also allows us to define the “sibling” relation:

sibling = parent ~parent.

This is syntactic sugar for sibling = parent° • parent or rel(sibling, A, B) :- rel(compose(parent, converse(parent)), A, B).

Intersection (meet)

The intersection of R and S, sometimes known as the “meet” of R and S (since binary relations form a lattice), is written R ∩ S, and contains just the pairs that are in both R and S: a (R ∩ S) b iff a R b ∧ a S b.

In Prolog, we can write:

rel(intersect(R, S), B, A) :- rel(R, B, A), rel(S, B, A).

In my ASCII notation, I write intersection with a comma, analogously to Prolog: R, S.

Suppose we have an older relation such that X.older = Y whenever Y is older than X. Given intersection, we can define an olderSibling relation:

olderSibling = sibling, older.

(Or, equivalently, older, sibling.)

This definition means olderSibling = sibling ∩ older or rel(olderSibling, B, A) :- rel(sibling, B, A), rel(older, B, A).

So if we know

sashaObama.older = maliaAnnObama.
maliaAnnObama.parent = barackObama.
sashaObama.parent = barackObama.

we can conclude

sashaObama.olderSibling = maliaAnnObama.

which is an abbreviation for

sashaObama.(parent ~parent, older) = maliaAnnObama.

(I’m considering the possibility that I should write sibling, older as sibling == older or sibling & older instead.)

Relational product

The fourth major operation on binary relations is the “relational product”, traditionally written with angle brackets and commas. This is an N-ary operation on any number of binary relations, and it constructs a relation whose target domain consists of tuples.

It is defined (for the two-element case) as (rb, sb) <R, S> b iff rb R b ∧ sb S b. In Prolog, for the general case:

rel(product([]), _, []).
rel(product([R|Rs]), B, [RB|RsB]) :- 
    rel(R, B, RB), rel(product(Rs), B, RsB).

In my ASCII notation I will write it with ;, as R; S, with enclosing parentheses to disambiguate when necessary. As with any N-ary operator, this is potentially confusing: (R; S; T) defines a relation whose target domain is 3-tuples, while ((R; S); T), which looks like it ought to be the same thing, defines a relation whose target domain is 2-tuples. The same problem is confronted by function-call syntax in Algol-family programming languages, in which f((x1, y1), (x2, y2)) is (if meaningful) different from f(x1, y1, x2, y2).

The sibling operator defined earlier is fairly non-specific, also including half-siblings. For example, given the facts:

barackObama.parent = annDunham.
barackObama.parent = barackObamaSr.
mayaSoetoroNg.parent = annDunham.
mayaSoetoroNg.parent = loloSoetoro.

our previous sibling relation, defined as parent ~parent, will report that

barackObama.sibling = mayaSoetoroNg.
mayaSoetoroNg.sibling = barackObama.

If we want a stricter definition of sibling that only connects people who share both a father and a mother, we can do it as follows:

parents = father; mother.
sibling = parents ~parents.

which is syntactic sugar for the Prolog:

rel(parents, B, A) :- rel(product([father, mother], B, A).
rel(sibling, B, A) :- rel(compose(parents, converse(parents)), B, A).

or the traditional-notation parents = <father, mother> and sibling = parents° • parents.

If we then write:

barackObama.mother = annDunham.
barackObama.father = barackObamaSr.
mayaSoetoroNg.mother = annDunham.
mayaSoetoroNg.father = loloSoetoro.

then we will find

barackObama.parents = (barackObamaSr, annDunham).
mayaSoetoroNg.parents = (loloSoetoro, annDunham).

(I’m skipping over a couple of pitfalls here. If you want to define parent in terms of father and mother you need union, which I’m not describing in this document, and if you want to define father and mother in terms of parent and some gender relation, you need a way to introduce constants into expressions defining binary relations, which I haven’t gotten to yet.)

This relational product corresponds in some sense to an equijoin in the relational algebra, with the join key being the source domain of both relations.

Originally I was going to write the relational product as [R; S; T], but Stefan Ljungstrand told me that was confusingly similar to the notation for categorical coproduct [R, S, T], which is sort of the opposite.

Parsing

I’ve defined the ASCII syntax pretty informally so far, and I don’t intend to stop now. But I thought I should list a couple of notes about the grammar.

Parsing precedence is converse, composition, intersection, then relational product; so the expression a, ~b c; d is parsed as (a, ((~b) c); d.

= has two different meanings, one when defining relations and one when listing particular facts. The syntax for listing a particular fact is A.B = C., where A and C are values, and B is a binary relation. Commas have meanings inside of both values and binary relations, but the meanings are unrelated.

Primitive relations for arithmetic

To formulate my example query above, I also need a way to do some arithmetic. I’ll define relations called sum and =< as follows. Sum is defined as a + b sum (a, b), or

(a, b).sum = a + b.

or

rel(sum, [A, B], C) :- var(A), not(var(B)), not(var(C)), A is C - B.
rel(sum, [A, B], C) :- not(var(A)), var(B), not(var(C)), B is C - A.
rel(sum, [A, B], C) :- not(var(A)), not(var(B)), var(C), C is A + B.
rel(sum, [A, B], C) :- not(var(A)), not(var(B)), not(var(C)), C is A + B.

=< is defined as follows: b =< a iff a ≤ b (note that the direction there is confusing!), or

rel(=<, A, B) :- A =< B.

or a.=< = b whenever ab. (Note that this kind of hairifies the informally defined grammar of the query language.)

Finally, I’ll decree that a numeric constant such as 3 will be treated as a relation that maps anything to that constant: 3 3 x or x.3 = 3 for any numeric constant 3 and any value x:

rel(N, _, N) :- number(N).

and define a primitive universal identity, or equality, relation i:

x.i = x.

or

rel(i, X, X).

(maybe this should be called self?) and define first and second predicates for destructuring ordered pairs:

(a, b).first = a.
(a, b).second = b.

or

rel(first, [A, _], A).
rel(second, [_, B], B).

The Whole Prolog Definition

Here is the definition of the basic operators and primitive relations, aggregated from the above sections:

:- multifile rel/3.  % because there are relational facts to add later
rel(compose(S, R), C, B) :- rel(S, C, A), rel(R, A, B).
rel(converse(R), B, A) :- rel(R, A, B).
rel(intersect(R, S), B, A) :- rel(R, B, A), rel(S, B, A).
rel(product([]), _, []).
rel(product([R|Rs]), B, [RB|RsB]) :- 
    rel(R, B, RB), rel(product(Rs), B, RsB).

rel(sum, [A, B], C) :- var(A), not(var(B)), not(var(C)), A is C - B.
rel(sum, [A, B], C) :- not(var(A)), var(B), not(var(C)), B is C - A.
rel(sum, [A, B], C) :- not(var(A)), not(var(B)), var(C), C is A + B.
rel(sum, [A, B], C) :- not(var(A)), not(var(B)), not(var(C)), C is A + B.
rel(=<, A, B) :- A =< B.
rel(N, _, N) :- number(N).
rel(i, X, X).
rel(first, [A, _], A).
rel(second, [_, B], B).

And here is a simple application, aggregated from the examples in the above sections:

% Note that in Prolog we get union for free!
rel(parent, A, B) :- rel(father, A, B).
rel(parent, A, B) :- rel(mother, A, B).
rel(parents, B, A) :- rel(product([father, mother]), B, A).
rel(sibling, B, A) :- rel(compose(parents, converse(parents)), B, A).
rel(child, B, A) :- rel(converse(parent), B, A).
rel(atLeastHalfSibling, B, A) :- rel(compose(parent, child), B, A).
rel(olderSibling, B, A) :- rel(intersect(sibling, older), B, A).
rel(firstName, barackObama, "Barack").
rel(father, sashaObama, barackObama).
rel(mother, sashaObama, michelleObama).
rel(older, sashaObama, maliaAnnObama).
rel(father, maliaAnnObama, barackObama).
rel(mother, maliaAnnObama, michelleObama).
rel(father, barackObama, barackObamaSr).
rel(mother, barackObama, annDunham).
rel(father, mayaSoetoroNg, loloSoetoro).
rel(mother, mayaSoetoroNg, annDunham).

Here are some queries:

rel(parent, maliaAnnObama, X).
rel(parents, maliaAnnObama, X).
rel(sibling, maliaAnnObama, X).  % note: by our definition she's her own sib
rel(atLeastHalfSibling, maliaAnnObama, X).
rel(olderSibling, sashaObama, X).
rel(olderSibling, maliaAnnObama, X).
rel(atLeastHalfSibling, barackObama, X).
rel(sibling, barackObama, X).
rel(compose(parent, firstName), sashaObama, X).
% note that i; parent; parent parent includes untoward pairings.
rel(product([i, parent, compose(parent, parent)]), sashaObama, X).
% to solve the problem, (i; parent) (first; second; second parent)
rel(compose(product([i, parent]), 
            product([first, second, compose(second, parent)])),
    sashaObama, X).

As I said earlier, though, this only handles finite relations correctly; if you start messing around with sum and =< you’re likely to come up with queries that are perfectly legitimate but that this naïve implementation cannot handle.

What’s missing

I’ve left out union, transitive closure, and any form of negation, whether infinite negation or constructive negation by means of subtraction. Also, it seems likely that aggregation operations (e.g. average, max, min) would add a lot of power, and they would be simple to add.

Division is missing but I’m not sure what it’s good for.

Solving the original problem in terms of these operations

Suppose we decompose the visits and square relations into nine binary relations: visits_page, visits_column, visits_row, visits_route, square_page, square_column, square_row, square_x, and square_y. These are functions from some abstract set of records to concrete values like "16", "c", and "41".

However, we don’t need to restrict ourselves to functions; we can go beyond SQL (and, in a way, outside of first normal form) and use many-to-many relations directly. So instead of dealing with the distinct visits and square tables, we will write our solution in terms of x, y, and route relations from (page, column, row) triples to coordinates and route numbers. The route relation is many-to-many — that is, it may map the same location to any number of route numbers. The others are still functions.

The first thing to do is to define these three relations in terms of the original ones. This step would be unnecessary with a database design constructed with binary relations in mind, but it’s useful to see that this conversion step can be accomplished very easily inside this algebra:

squarekey = ~(square_page; square_column; square_row).
x = squarekey square_x.
y = squarekey square_y.
route = ~(visits_page; visits_column; visits_row) visits_route.

So squarekey is a relation from (page, column, row) triples to “rows” of the square relation; composing it with square_x and square_y gives us relations from those triples to X and Y coordinates.
route is analogous for routes.

Now let’s define a “numerically-near” relation near:

succ = (i; 1) sum.
near = succ ~=<, ~succ =<.

If you expand that last one out into Prolog:

rel(near, A, B) :- rel(succ, A, Z), B =< Z, rel(succ, X, A), X =< B.

In other words, B near A iff B ≤ A+1 ∧ A-1 ≤ B.

That’s enough to write neighbor:

neighbor = x near ~x, y near ~y.

Now we would like to write route_between, but I will call it connected_by:

connected_by = first route, second route.

Now we can abstract the general pattern of the query into a connected_to_neighbor_by:

connected_to_neighbor_by = (first; second neighbor) connected_by.

So the whole program is:

squarekey = ~(square_page; square_column; square_row).
x = squarekey square_x.
y = squarekey square_y.
route = ~(visits_page; visits_column; visits_row) visits_route.
succ = (i; 1) sum.
near = succ ~=<, ~succ =<.
neighbor = x near ~x, y near ~y.
connected_by = first route, second route.
connected_to_neighbor_by = (first; second neighbor) connected_by.

It’s slightly shorter than the Prolog version. If we assume that the original data is in a suitable form of binary relations, instead of requiring us to restructure it into binary relations, we get:

succ = (i; 1) sum.
near = succ ~=<, ~succ =<.
neighbor = x near ~x, y near ~y.
connected_by = first route, second route.
connected_to_neighbor_by = (first; second neighbor) connected_by.

which is much shorter than the Prolog version.

I think it is probably better not to use positional arguments; not only do they complicate the base language with first, second (third, etc.) relations, those relations are infinite, and queries using them tend to be a little opaque. The cost is that the relational product operation needs names to attach to the ranges of its operand relations. With this change, the program looks like this:

squarekey = ~{page: square_page; column: square_column; row: square_row}.
x = squarekey square_x.
y = squarekey square_y.
route = ~{page: visits_page; column: visits_column; row: visits_row}
        visits_route.
succ = {addend1: i; addend2: 1} sum.

near = succ ~=<, ~succ =<.
neighbor = x near ~x, y near ~y.
connected_by = origin route, destination route.
connected_to_neighbor_by = {origin: origin; 
                            destination: destination neighbor} connected_by.

This algebra is equivalent to the relational algebra

My original query expressed in the relational algebra looks like this, more or less:

route_between ← σ_(start.route = end.route)(ρ_start(visits) × ρ_end(visits))
squarepairs   ← ρ_a(square) × ρ_b(square)
neighbor      ← σ_(a.x-1 ≤ b.x ≤ a.x+1 ∧ a.y-1 ≤ b.y ≤ a.y+1)(squarepairs)
rtn ← σ_(end.page = a.page ∧ end.row = a.row ∧ end.column = a.column)( \
         route_between × neighbor)
out ← σ_(start.page = '9' ∧ start.column = 'b' ∧ start.row = '4' ∧
         b.page = '16' ∧ b.column = 'c' ∧ b.row = '4')(rtn)
Π_route(out)

I think that in this form, this algebra (if that’s what it is) on binary relations is equivalently powerful to the select/project/cartesian-product SPC relational algebra, and substantially briefer. (There’s no way to do unions, so to equal the SPCU algebra you need an additional operation.)

“Project” is carried out by combining the columns you want using the relational product operation. For example, if you have some row objects that have many-to-one relations a, b, c, and d defined on them, and you want a corresponding set of tuples with just the a, b, and d values, you can do this:

a; b; d

“Select” is a little hairy because it takes arbitrary Boolean expressions; but generally they take the form of equality tests between columns, or between columns and constants. This is done with the intersection operator combined with the inversion of some columns. To select rows where a = 2 and b = 3, you apply this relation to any input, or something:

2 ~a, 3 ~b

This gives you a relation from a source domain of everything to a target domain consisting of just the rows you want. If you then want a relation from those rows to, say, column c, you can filter down the source domain to just the rows themselves using the i relation, and then compose that with your existing c:

(2 ~a, 3 ~b, i) c

To select rows where a == b, you also need to use the identity relation. b ~a gives you all the pairs of rows where one row’s a matches the other row’s b; intersecting that with i gives you just the cases where the two rows happen to be the same row:

b ~a, i

These can be intersected to form more complex join conditions:

2 ~a, 3 ~b, b ~c, i

A Cartesian product can be constructed by inverting some constant and composing it with some relation that returns the right set of rows, then joining that with the same constant mapped to some other table:

~37 ~square_page; ~37 ~visits_page

This produces a mapping from the constant 37 to all pairs of rows from square and visits.

With the exception of tuple handling, the composition, intersection, and inversion operations are trivial to translate into the relational algebra on a bunch of two-column relations. Here I use s for the source column and t for the target column.

a b → Π_(a.s, b.t) σ_(a.t = b.s) a × b
a, b → Π_(a.s, a.t) σ_(a.s = b.s, a.t = b.t) a × b
~a → ρ(s→t, t→s) a

In the absence of binary relations containing tuples with varying numbers of members, you can deal with the tuple handling by making the relations have more than two columns. For example, you can handle a two-relation product this way:

a; b → σ_(a.s = b.s) a × b

Then you have to keep track of which columns represent the "source" and "target" of your new, wider relation, in order to apply the trivial translations above for composition, intersection, and inversion.

Adding set subtraction to either the relational algebra or this algebra of binary relations means that the other one cannot model it until the corresponding operation is added; and similarly for union, although if you allow unions of unlike data, such as tuples of different lengths, you will have trouble with the translation to the relational algebra.

Differences from my “mvfs” idea of 2003

Other similar things

XPath. XQL. UnQL. Tree regular expressions. Django’s notation for traversing foreign-key relationships. My “object-oriented equational rewrite rules” thing. Backus’ FP.

The book An Algebra Of Programming has a whole bunch of stuff about binary relations in it. I should read it!

Claudia Mónica Necco (Argentine D.N.I. Nº 16.051.159) is working on reasoning in terms of binary relations to prove correctness properties of programs. She has published a paper on using them for extended static checking of programs that interact with databases.

“A relational data language with simplified binary relation handling capability”, Kambayashi, Tanaka, and Yajima, VLDB 1977? I don’t have a copy, and it looks like nobody has ever cited it, but it sounds very similar indeed!

RDF, SPARQL, and cwm are based on binary relations. But SPARQL (“emerging as the de-facto RDF query language,” says Wikipedia) seems to be dramatically more verbose than relational query langauges rather than less. cwm is hardly better; in N3 we have:

@forAll :child, :parent. {:child gc:parent :parent. :parent :gender :M} log:implies {:child gc:father :parent} .

rather than:

gc:father = gc:parent, "M" ~gender.

There’s a lovely comparison of RDF query languages from 2004, covering RQL, SeRQL, TRIPLE, RDQL, N3, and Versa (but not SPARQL!). It uses as its example query a query that this calculus of binary relations cannot express directly, because it asks if the RDF graph contains an edge of any type whatsoever to another node.

Binary Relations Approach to building Object Database Model”, Andrew Girow, in Object Currents Vol. 1, Issue 11, SIGS Publications, NY, 1996? He doesn’t seem to have taken the idea very far.

Implementation of a Graph Oriented Query Language: IUGQL” is Indiana University Tech Report 376, by Sarathy and Van Gucht, 1993. They took a somewhat similar approach, but their query language appears to suck (the first example is on p.27).

SnikiSniki is a simple end-user-oriented database built out of RDF-like triples.

TODO

Another Example

From XQueryX, one of the XQuery Use Cases.

Sample XQuery query:

<bib>
 {
  for $b in doc("http://bstore1.example.com/bib.xml")/bib/book
  where $b/publisher = "Addison-Wesley" and $b/@year > 1991
  return
    <book year="{ $b/@year }">
     { $b/title }
    </book>
 }
</bib>

(I’m omitting the 150-line monstrosity that is the XQueryX version.)

I think my version would look something like this:

books = "http://bstore1.example.com/bib.xml" GET <bib> <book>.
wanted = (books, "Addison-Wesley" ~$ ~<publisher>, 1991 < ~# ~@year).
{<bib>: {<book>: wanted {@year: @year; $: <title>}}}

This makes the following assumptions about the already-defined names:

There is one crucial difference from the XQuery results: the results are set-oriented, rather than a sequential thing like an XML document. The results are here represented using the same relations as in the input, and so could be serialized as XML, but they are more general than any particular XML serialization of their contents.

And there are more use cases where that came from.

Kombayashi’s 1977 work did not significantly anticipate this

In 1977, Yahiko KOMBAYASHI, Katsumi TANAKA, and Shuzo YAJIMA published a paper in VLDB entitled, “A relational data language with simplified binary relation handling capability”. They describe a query language they have been developing that has special features for handling binary relations, including composition (“concatenation”), transitive closure, and intersection (“boolean operators”), but not including converse and relational product.

Unfortunately, the query language they present owes a lot to SQL (called, at the time, SEQUEL); it is not an expression language, and its data model is very much the standard N-ary relational data model.

Here is the first sample RLB query from their paper:

     E1  Find all titles(TI) of  documents  written  by
author(AU)’A’.

    GET DOCUMENT.TI
    WHERE DOCUMENT.DID# = AUTHOR.DID#
      AND AUTHOR.AU = ’A’.

This is essentially the same level of verbosity as SQL. The AUTHOR relation is somewhat misleadingly named; it is essentially a join table created to support the many-to-many relationship between documents and authors, which is why it has a DID# column. (However, in Kombayashi’s schema, there is no actual table of authors to join with.)

In the language presented in the present note, a literal translation of the above produces:

'A' ~au did# ~did# ti.

However, in the world of binary relations, there’s no need for a separate join table with au and did# attributes; the “author” attribute of a document can simply be a many-to-many relation directly. This simplifies the query to:

'A' ~author title.

(I have taken the liberty of using longer attribute names.)

Here's a definition of a new binary relation from the same paper:

     E3 Define a binary relation between document  IDs,
        each  of  which  cites  at  least  one document
        commonly.

     RANGE OF X,Y ARE REFERENCE
     DEFINE BINARY SIBLING
          DID1# = X.DID#, DID2# = Y.DID#
     WHERE X.DIDF# = Y.DIDF#
     CONSTRAINT SEMIEQ

This defines a new binary relation called SIBLING. In the binary relation language defined in the present note, this can be written more briefly:

sibling = ~did# didf# ~didf# did#.

One procedural reading of this is as follows: given a document ID number, first find the document it is the did# of; then find the didf#s of that document (which I suppose are the IDs of the documents it cites); then find all the documents that have one of those among their didf#s; then find the did#s of those documents.

This lacks the explicit information about the types of X and Y in Kombayashi’s query; in a statically typed implementation of this language, you could infer this from the types of the did# and didf# relations (which would have to be compatible), and in fact would have to do so in the type-checking process; but without some kind of type annotation, you wouldn’t have a way to verify that the inferred type was what you expected.

It also lacks the constraint given in that the new relation be a “semi-equivalence” relation, that is to say that it is reflexive and symmetric, but not necessarily transitive. Proving that it is symmetric is pretty simple:

sibling = ~did# didf# ~didf# did#.           (definition of sibling)
sibling = ~~(~did# didf# ~didf# did#).       (x = ~~x)
sibling = ~(~did# ~~didf# ~didf# ~~did#).    (~(x y) = ~y ~x)
sibling = ~(~did# didf# ~didf# did#).        (x = ~~x)
sibling = ~sibling.                          (definition of sibling)

Proving that it is reflexive is somewhat more difficult, since it’s only reflexive over the range of document IDs that actually correspond to documents that reference at least one other document, and our notation doesn’t include a way to express even that one binary relation is a subset of another.

It isn't clear to me whether the constraint and range declarations in Kombayashi are intended for optimization or for error-checking — that is, whether the query processor was expected to assume that they were correct or to verify that they were correct. Perhaps I will ask Dr. Tanaka.

In any case, this relation, between documents rather than document-IDs, is probably more useful:

sibling2 = didf# ~didf#.

Credits

I developed a lot of these ideas in discussion with Darius Bacon.

Many thanks to Stefan Ljungstrand (ski on Freenode) for his help and explanations of many matters related to logic, binary relations, category theory, and notation.

Written by Kragen Javier Sitaker.