#!/usr/bin/python # -*- coding: utf-8 -*- """Nesting consistent-overhead byte stuffing. The original Cheshire and Baker COBS algorithm encodes a byte string that can include all 256 bytes into a byte string that is guaranteed not to include byte 0, with a worst-case overhead of 1/254 (0.394% inflation), which is only about five or six times worse than the information-theoretic bound of log(255)/log(254) - 1 = 0.071%. In the average case (with random data), its overhead is half that. Moreover, COBS is very fast, and if the original data uses NUL bytes as record delimiters, it permits relatively rapid skipping from one variable-length record to the next, without iterating over all the bytes in the record. Neither the Cheshire and Baker paper nor Cheshire’s dissertation gives quantitative measurements of performance. Both Unix and the once-popular Pick OS/database make extensive use of variable-length records encoded using what Cheshire calls “non-transparent in-band framing”, i.e. special byte values reserved for the framing. As one consequence, in Unix you cannot have a colon, comma, or newline in your “full name” or “room number”, and you cannot have a colon or newline in your home directory, username, or shell. Cheshire’s dissertation mentions that “transparent non-stuffing in-band” framing, such as length-prefixing, impede resynchronization after errors. To this we can add that they also impede resynchronization when seeking to the middle of a file (for example, for a binary search as traditionally performed by look(1)) or beginning to receive a transmission after it has started (such as a broadcast MPEG stream, a case mentioned later on in the dissertation). COBS, by contrast, is a transparent stuffing encoding. You can take a string, encode it in COBS to eliminate NUL bytes from it, and then use a NUL byte for framing — for example, to terminate the string or to indicate the start of a header that you prepend to the string. Then you can concatenate multiple such strings, and if you desire, you can run the result through COBS again and repeat the process losslessly, with multiple layers of framing, at the cost of a small exponential inflation. However, to reverse the process, you must know a priori how many levels of COBS to unstuff in each case. (I say “small exponential”: COBS provokes its own worst case when repeatedly encoding a large chunk of data, and roughly every 176 levels of encoding, the data doubles in size. It takes 24 levels of encoding to increase in size by 10%.) This code focuses on an application of iterated COBS to support nested data structures that are efficiently encodable and efficiently navigable. Vanilla COBS works by appending a virtual zero byte to the end of the data to encode, then encoding it in chunks with the following header bytes: | Header byte | followed by | encodes | | n: 0x01–0xfe | n-1 data bytes | those n-1 data bytes | | | | followed by byte 0x00 | | 0xff | 254 data bytes | those 254 data bytes *not* | | | | followed by byte 0x00 | | 0x00 | | illegal | Nested COBS, rather than merely an encoding of byte streams, is instead an encoding performing a job similar to bencode, BER, or JSON; it uses a possibly iterated variant of the COBS encoding with an additional header byte at the beginning of each NUL-terminated frame, inside. Specifically, nested COBS encodes a byte string, an integer, a dict, a list, true, false, or null; this differs from JSON in that the strings are byte strings and the numbers are integers. These are indicated by the prefix bytes s, i, d, a, t, f, and n, which are followed by contents. The contents of true, false, and null are empty; the contents of a byte string are the bytes of the string; the contents of a list are a sequence of nested COBS values representing the items in the list; the contents of an integer are an arbitrary-length big-endian binary two’s-complement representation of the integer; and the contents of a dict are somewhat more complicated. The contents of a dict are a sequence of COBS strings which are *not* nested COBS values; instead, they are concatenations of a COBS string, representing the dict key, and a nested COBS value. Thus the key (as in bencode and JSON, though not in Python’s data model) must be a string; the value can be any kind of value that nested COBS can represent. This involves additional overhead (typically one byte per name-value pair) over the simpler expedient taken by bencode of simply alternating between names and values, but it permits reasonably efficient binary search on the dict representation. There’s an “extended COBS” that permits larger chunk sizes: | Header byte | followed by | encodes | | n: 0x01–0xfd | n-1 data bytes | those n-1 data bytes | | | | followed by byte 0x00 | | 0xfe | 253 data bytes | those 253 data bytes | | | | *not* followed by byte | | | | 0x00 | | 0xff | 2 length bytes | those n+254 data bytes | | | encoding n, | *not* followed by | | | then n+254 | byte 0x00 | | | data bytes | | | 0x00 | | illegal | This increases the worst-case overhead from 0.394% to 0.395%, but permits the use of chunks of up to 65790 bytes. This permits efficient skipping of large chunks of data that is known to be uninteresting because it does not contain a framing NUL byte. For example, in the dict encoding above, if you have a key associated with a value encoded in one mebibyte that is not the key you are looking for, the encoded value is guaranteed not to contain a NUL byte, so you can skip to the next key by examining 16 such three-byte headers, a total of 48 bytes, rather than all 1048624 bytes. TODO: does the above vitiate the property of seekability in the dict? To seek in the dict you need to find the nearest framing byte. Alternatively: rather than using single zero bytes as framing delimiters, use (possibly unaligned) *pairs* of zero bytes. Then you can use a (possibly unaligned) two-byte header all the time, and you get two bytes of overhead per 65535 bytes in the worst case (0.003%). This has potentially crushing overhead if you are concatenating a lot of short items (one-byte or two-byte integers, say, or trues and falses in the above encoding) but maybe in that case you can use some of the header byte namespace for short fields. For example: | n: 0x01–0x7f | n-1 data bytes | those n-1 bytes | | | | followed by two 0x00 bytes | | n: 0x80–0xff | a second count byte | the data bytes, followed | | | m and then data of | by two 0x00 bytes, | | | ((n&0x7f) << 8) + m | except if the total count | | | + 0x7d bytes | field was 0xff 0xff | This allows you to logically use a pair of NUL bytes as your framing delimiter, but for data of 126 bytes or less, the actual overhead in storage will be only one byte. The encoded chunks are limited to 0x7fff + 0x7d = 32892 bytes, but that still allows you to leap across a mebibyte in only 32 jumps. Your worst-case overhead is 0.0061%, and in the usual case (where the input chunks are less than 32K) this doesn’t cost you anything on decoding. And it achieves roughly the same compression as Cheshire’s COBS/ZPE (in the cases where COBS/ZPE achieves compression) at the cost of requiring a longer framing sequence. (That is, it doesn’t eliminate zero bytes!) This is similar to the COWS Cheshire proposes in his dissertation, but he was considering the case where the atomic units of communication are wider words, not bytes. So his COWS doesn’t consider the question of zero bytes at the beginning or end of the count field which could happen to abut zero bytes in the preceding or following data, producing false framing sequences; for the same reason, he considers the count field to count words rather than bytes. I’m tempted to allocate another bit in the second count byte as a “nonleaf” indicator and remove the 0x7d excess in that case, just so that the determination of whether a nested COBS chunk contains just plain bytes or chunks wouldn’t depend on some more fragile higher-level convention. This temptation is bad, for two reasons. One is that the nonleaf frame may be (perhaps usually will be) spanned across several COBS chunks, so this information belongs in the frame, not in the chunk headers. The other is that it would halve the length of unterminated chunks again, doubling the cost of striding across mebibytes. Intuitively, we should expect two bytes of framing overhead per “line of text” to be a reasonable overhead, because that’s what Microsoft Windows, MS-DOS, CP/M, RT-11, and VAX/VMS used; they just used 0x0d 0x0a at the end of the line, rather than a length field at the beginning. Another design alternative is to use length *trailers* rather than length headers. This has the advantage that none of the data bytes ever need to be buffered up in memory, and all the writing is sequential. It has the disadvantage that reading must start from the end of the “frame” rather than the beginning. This may actually be a better fit for current Flash storage systems, which are reasonably adept at random reads but cannot really handle random writes, because they must erase in large units rather than individual sectors. An interesting property of iterated COBS is that the “reliability” and “locality” properties Cheshire was striving for do not survive in the encoded data. The top-level framing, with its zero bytes (or pairs of zero bytes in the variation I suggested above), retains “reliability” (it is guaranteed to resynchronize after an error) and “locality” (if a frame and its immediately adjacent framing markers are received correctly, then the frame will be received correctly). But the data within it has had its zero bytes transformed into arbitrary random bytes, so it does not have these properties, while the “two-for-one” byte escaping systems like PPP do, though at the cost of rapid exponential growth when iterated. Moreover, the rapid navigability that attracts me to the length-prefixed format is not present at this top level, because it does not have length fields. """