#!/usr/bin/python
# -*- coding: utf-8 -*-
"""Nesting consistent-overhead byte stuffing.

The original Cheshire and Baker COBS algorithm
<http://www.stuartcheshire.org/papers/cobsforton.pdf> encodes a byte
string that can include all 256 bytes into a byte string that is
guaranteed not to include byte 0, with a worst-case overhead of 1/254
(0.394% inflation), which is only about five or six times worse than
the information-theoretic bound of log(255)/log(254) - 1 = 0.071%.  In
the average case (with random data), its overhead is half that.
Moreover, COBS is very fast, and if the original data uses NUL bytes
as record delimiters, it permits relatively rapid skipping from one
variable-length record to the next, without iterating over all the
bytes in the record.

Neither the Cheshire and Baker paper nor Cheshire’s dissertation
<http://www.stuartcheshire.org/papers/Dissertation.ps> gives
quantitative measurements of performance.

Both Unix and the once-popular Pick OS/database make extensive use of
variable-length records encoded using what Cheshire calls
“non-transparent in-band framing”, i.e. special byte values reserved
for the framing.  As one consequence, in Unix you cannot have a colon,
comma, or newline in your “full name” or “room number”, and you cannot
have a colon or newline in your home directory, username, or shell.

Cheshire’s dissertation mentions that “transparent non-stuffing
in-band” framing, such as length-prefixing, impede resynchronization
after errors.  To this we can add that they also impede
resynchronization when seeking to the middle of a file (for example,
for a binary search as traditionally performed by look(1)) or
beginning to receive a transmission after it has started (such as a
broadcast MPEG stream, a case mentioned later on in the dissertation).

COBS, by contrast, is a transparent stuffing encoding.  You can take a
string, encode it in COBS to eliminate NUL bytes from it, and then use
a NUL byte for framing — for example, to terminate the string or to
indicate the start of a header that you prepend to the string.  Then
you can concatenate multiple such strings, and if you desire, you can
run the result through COBS again and repeat the process losslessly,
with multiple layers of framing, at the cost of a small exponential
inflation.

However, to reverse the process, you must know a priori how many
levels of COBS to unstuff in each case.

(I say “small exponential”: COBS provokes its own worst case when
repeatedly encoding a large chunk of data, and roughly every 176
levels of encoding, the data doubles in size.  It takes 24 levels of
encoding to increase in size by 10%.)

This code focuses on an application of iterated COBS to support nested
data structures that are efficiently encodable and efficiently
navigable.  Vanilla COBS works by appending a virtual zero byte to the
end of the data to encode, then encoding it in chunks with the
following header bytes:

    |  Header byte | followed by    | encodes                    |
    | n: 0x01–0xfe | n-1 data bytes | those n-1 data bytes       |
    |              |                | followed by byte 0x00      |
    |         0xff | 254 data bytes | those 254 data bytes *not* |
    |              |                | followed by byte 0x00      |
    |         0x00 |                | illegal                    |

Nested COBS, rather than merely an encoding of byte streams, is
instead an encoding performing a job similar to bencode, BER, or JSON;
it uses a possibly iterated variant of the COBS encoding with an
additional header byte at the beginning of each NUL-terminated frame,
inside.

Specifically, nested COBS encodes a byte string, an integer, a dict, a
list, true, false, or null; this differs from JSON in that the strings
are byte strings and the numbers are integers.  These are indicated by
the prefix bytes s, i, d, a, t, f, and n, which are followed by
contents.  The contents of true, false, and null are empty; the
contents of a byte string are the bytes of the string; the contents of
a list are a sequence of nested COBS values representing the items in
the list; the contents of an integer are an arbitrary-length
big-endian binary two’s-complement representation of the integer; and
the contents of a dict are somewhat more complicated.

The contents of a dict are a sequence of COBS strings which are *not*
nested COBS values; instead, they are concatenations of a COBS string,
representing the dict key, and a nested COBS value.  Thus the key (as
in bencode and JSON, though not in Python’s data model) must be a
string; the value can be any kind of value that nested COBS can
represent.  This involves additional overhead (typically one byte per
name-value pair) over the simpler expedient taken by bencode of simply
alternating between names and values, but it permits reasonably
efficient binary search on the dict representation.

There’s an “extended COBS” that permits larger chunk sizes:

    |  Header byte | followed by    | encodes                |
    | n: 0x01–0xfd | n-1 data bytes | those n-1 data bytes   |
    |              |                | followed by byte 0x00  |
    |         0xfe | 253 data bytes | those 253 data bytes   |
    |              |                | *not* followed by byte |
    |              |                | 0x00                   |
    |         0xff | 2 length bytes | those n+254 data bytes |
    |              | encoding n,    | *not* followed by      |
    |              | then n+254     | byte 0x00              |
    |              | data bytes     |                        |
    |         0x00 |                | illegal                |

This increases the worst-case overhead from 0.394% to 0.395%, but
permits the use of chunks of up to 65790 bytes.  This permits
efficient skipping of large chunks of data that is known to be
uninteresting because it does not contain a framing NUL byte.  For
example, in the dict encoding above, if you have a key associated with
a value encoded in one mebibyte that is not the key you are looking
for, the encoded value is guaranteed not to contain a NUL byte, so you
can skip to the next key by examining 16 such three-byte headers, a
total of 48 bytes, rather than all 1048624 bytes.

TODO: does the above vitiate the property of seekability in the dict?
To seek in the dict you need to find the nearest framing byte.

Alternatively: rather than using single zero bytes as framing
delimiters, use (possibly unaligned) *pairs* of zero bytes.  Then you
can use a (possibly unaligned) two-byte header all the time, and you
get two bytes of overhead per 65535 bytes in the worst case (0.003%).
This has potentially crushing overhead if you are concatenating a lot
of short items (one-byte or two-byte integers, say, or trues and
falses in the above encoding) but maybe in that case you can use some
of the header byte namespace for short fields.  For example:

    | n: 0x01–0x7f | n-1 data bytes      | those n-1 bytes            |
    |              |                     | followed by two 0x00 bytes |
    | n: 0x80–0xff | a second count byte | the data bytes, followed   |
    |              | m and then data of  | by two 0x00 bytes,         |
    |              | ((n&0x7f) << 8) + m | except if the total count  |
    |              | + 0x7d bytes        | field was 0xff 0xff        |

This allows you to logically use a pair of NUL bytes as your framing
delimiter, but for data of 126 bytes or less, the actual overhead in
storage will be only one byte.  The encoded chunks are limited to
0x7fff + 0x7d = 32892 bytes, but that still allows you to leap across
a mebibyte in only 32 jumps.  Your worst-case overhead is 0.0061%, and
in the usual case (where the input chunks are less than 32K) this
doesn’t cost you anything on decoding.  And it achieves roughly the
same compression as Cheshire’s COBS/ZPE (in the cases where COBS/ZPE
achieves compression) at the cost of requiring a longer framing
sequence.  (That is, it doesn’t eliminate zero bytes!)

This is similar to the COWS Cheshire proposes in his dissertation, but
he was considering the case where the atomic units of communication
are wider words, not bytes.  So his COWS doesn’t consider the question
of zero bytes at the beginning or end of the count field which could
happen to abut zero bytes in the preceding or following data,
producing false framing sequences; for the same reason, he considers
the count field to count words rather than bytes.

I’m tempted to allocate another bit in the second count byte as a
“nonleaf” indicator and remove the 0x7d excess in that case, just so
that the determination of whether a nested COBS chunk contains just
plain bytes or chunks wouldn’t depend on some more fragile
higher-level convention.  This temptation is bad, for two reasons.
One is that the nonleaf frame may be (perhaps usually will be) spanned
across several COBS chunks, so this information belongs in the frame,
not in the chunk headers.  The other is that it would halve the length
of unterminated chunks again, doubling the cost of striding across
mebibytes.

Intuitively, we should expect two bytes of framing overhead per “line
of text” to be a reasonable overhead, because that’s what Microsoft
Windows, MS-DOS, CP/M, RT-11, and VAX/VMS used; they just used 0x0d
0x0a at the end of the line, rather than a length field at the
beginning.

Another design alternative is to use length *trailers* rather than
length headers.  This has the advantage that none of the data bytes
ever need to be buffered up in memory, and all the writing is
sequential.  It has the disadvantage that reading must start from the
end of the “frame” rather than the beginning.  This may actually be a
better fit for current Flash storage systems, which are reasonably
adept at random reads but cannot really handle random writes, because
they must erase in large units rather than individual sectors.

An interesting property of iterated COBS is that the “reliability” and
“locality” properties Cheshire was striving for do not survive in the
encoded data.  The top-level framing, with its zero bytes (or pairs of
zero bytes in the variation I suggested above), retains “reliability”
(it is guaranteed to resynchronize after an error) and “locality” (if
a frame and its immediately adjacent framing markers are received
correctly, then the frame will be received correctly).  But the data
within it has had its zero bytes transformed into arbitrary random
bytes, so it does not have these properties, while the “two-for-one”
byte escaping systems like PPP do, though at the cost of rapid
exponential growth when iterated.  Moreover, the rapid navigability
that attracts me to the length-prefixed format is not present at this
top level, because it does not have length fields.

"""