#!/usr/bin/python # -*- coding: utf-8 -*- """Buxus microfila: a form for OSM data optimized for fast map drawing. (Its branches look like little box files with small leaves, thus the name.) Buxus microfila is a file format that provides compact storage and very rapid retrieval of OpenStreetMap data for drawing maps. A buxus file contains a tree of nested bounding boxes at different zoom levels; each box has a node in the tree that contains only the most important 32 kilobytes of data in that box, which may also be duplicated in other boxes of the same zoom level (in the case of ways that cross boxes) or smaller or larger boxes. This allows a map-drawing program to consume only the amount of data that it needs, which will usually be only a few boxes, and so it will be fast to access and filter. In particular, a map-drawing program intended for a cellphone screen should always be able to get the data it needs in no more than five I/O operations reading a total of under 320 kibibytes. That's up to 160 kibibytes from reading the directory (some 20 bytes per box, so a 128-meg file with about 8192 boxes will need 160 kibibytes of directory data) and one to four boxes, a maximum of a bit over 128 kibibytes of data for the map. That’s 5 IOPS, worst case, and usually 3. If you need higher resolution, you may need to go deeper, of course. API --- This is a Python implementation of the API and file format. There will also be C, OCaml, and maybe Java implementations, but I haven’t written them yet. Heck, I haven’t even finished with this one. What I have in mind is something like buxample.py. Here .ways() efficiently returns a potentially very large list of ways that intersect the given bounding box, ordered by importance. The ``canvas`` object with its `.project`, `.moveto`, `.lineto`, and `.stroke` methods is somewhat fictitious, but it’s closely modeled on PostScript, PDF, , and SVG, except that of course those don’t do map projections. You have to break out of the loop early or you’ll end up running for a really long time. File format ----------- So, a Buxus file consists of a header, some boxes, and a trailer. The header is just a magic number, the 16 bytes “Buxus microfila\n”. The boxes follow, one after the other, without delimiters. The trailer consists of a directory of boxes, followed by a pointer to the byte offset of the beginning of the trailer. The last 8 bytes in the file are this pointer. The pointer is *not* encoded with bytesquish. The directory entries in the trailer contain just a bounding box (lat_min, lon_min, lat_max, lon_max) and a byte offset at which to find the given box. The latitudes and longitudes are encoded, in bytesquish encoding, without delta encoding, as signed integer multiples of .00001 degrees, which is about a meter. The byte offset is also encoded in bytesquish encoding. A box consists of a sequence of ways, maybe some nodes, and finally a string table. The end of the string table is the end of the box. A way is represented (currently) as a "\n" byte, its way id as a bytesquished int, two bytesquished int arrays consisting of the deltas between latitudes (initial value 0) and the deltas between longitudes, and two more bytesquished int arrays consisting of indices into the string table for the tag names and tag values of the way. Like the bounding boxes, the latitude and longitude arrays are encoded as signed integer multiples of .00001 degrees, but most of the coordinates are single bytes because of the delta-encoding and bytesquish. The tag names and tag values are also, in practice, almost always single bytes because the string table for a single box almost never seems to grow to 63 items. Todo ---- Figure out why it’s emitting far too many boxes. Terminate string tables with '\0' instead of counting them. Maybe switch to protobufs? Try PyPy or OCaml. Squarify boxes in a latitude-sensitive way. Do something sensible with missing nodes. Bugs ---- I’m pretty sure the current code for intersection testing or bounding boxes or something is wrong, because it ends up generating far too many tiny tree boxes. Need to dig into that a bit more. The bbox intersection test is broken because when the code (incorrectly) includes (0, 0) for missing nodes, then only 27 ways (which all have missing nodes) are included in any sub-box. All the others are incorrectly omitted even from the box that actually contains Baghdad. Wait! Now that I’m doing the delta-encoding on output instead of internally, it’s 51 instead of 27, but they still all have (0, 0) coordinates. Without the tree, Baghdad was 700K. I was expecting an increase of like 5% from building the box tree. Instead, we get 1700K. Something is wrong. What is it? Are we tending to duplicate entire enormous ways into hundreds of boxes (in which case, we should clip them?) Is it the per-box overhead of 500 boxes with duplicate string tables? That should be only about 100K, because the string table is 217 bytes right now. 500 boxes is still probably too many; I was thinking like 100, and 1700K should be like 200, but I guess the average box is half the size of the max. A few ways are duplicated into an unreasonable number of boxes, although they don’t seem to actually contain that much data (this was for 8KB boxes with 16-way branching): 58 way 4236970 type 3 60 way 4236969 type 3 73 way 80151188 type 8 73 way 82873610 type 8 98 way 4075037 type 2 100 way 4075038 type 2 150 way 80177849 type 8 150 way 80177855 type 8 245 way 79974287 type 8 245 way 80918599 type 8 (Type 8 is “proposed”, type 2 is “secondary”, and type 3 is “unclassified”.) It doesn’t yet squarify the boxes as it should. Squarifying probably involves dividing certain bboxes up as 2×4 or 4×2 instead of 3×3, so that the smaller boxes get closer to squareness. It doesn’t yet encode node tags or relations. Tradeoffs --------- Currently, the maximum box size is 32768 bytes of ways, containing about 700–1000 ways, and the box subdivision factor is 3×3. I don’t fully understand the implications of those settings. Bigger boxes also mean less directory data. 1.4 kilobytes for the 73 boxes of Baghdad is 20 bytes a box, and so a gigabyte divided into 65536 boxes will need 1280 kibibytes of directory data. 1280 kibibytes should be enough for anyone. Bigger boxes make tags heavier as they overflow the 63-item limit for 1-byte string references. But they also share the same string table among more things that refer to them. Bigger boxes mean that readers will find the data they need at a shallower depth in the tree, so they won’t have to consult as many boxes. But they also mean that readers will have to deal with more irrelevant data in the boxes they do consult. Higher branching factors mean that readers will typically need to consult more boxes, but they result in a file with fewer boxes. ### Previously ### The maximum box size is 8192 bytes of ways (plus one way), holding about 200–300 ways at present. The box subdivision factor is 4×4. This results in a maximum tree depth for Baghdad of I think 4; I imagine Argentina will be 7. These represent performance tradeoffs I don’t fully understand: bigger boxes and more finely subdivided boxes result in shallower trees. Bigger boxes and less finely subdivided boxes result in more locality of reference, but also pawing through more dross. 256 ways averaging 8 node visits is about 1792 line segments per box, but a map-drawing program for a cellphone display only needs about 256 to 512 line segments, it has an aspect ratio of nearly 2, and an average screenful will be halfway between zoom levels and not aligned with box boundaries, so it will probably have to access something like 3×5 = 15 boxes from the file, containing a total of almost 27000 line segments, about 100× more data than it needs and thus 100× worse performance (in the part that filters those segments for visibility). But maybe you can just look at the 1792 line segments in the smallest box enclosing the screenful, and then you have only 3 to 7 times as much data as you need. Unless you’re zoomed in so far that you need stuff that’s omitted from the larger box This makes me think that probably I should use a larger box byte size and less subdivision, ensuring that you can always get what you need from a single box or four boxes joined at a corner, so that the worst case is something like 512 or 1024 ways: an order of magnitude more data than optimal, but no more than that. Performance ----------- Using Buxus to treeify the 103MB Argentina PBF data file I have handy (which is 2.3GB in XML format) takes about 5'45" on my laptop (and 2500MB of RAM, which is unacceptably large) and generates a 53MB output file consisting of 4898 boxes. Somewhat to my surprise, a significant number of the string tables are over 63 strings, suggesting that maybe I should use smaller boxes. Dumping this file with -d takes 40 seconds on my laptop, so decoding each box takes about 8ms; will probably be more like 80ms in Python on a cellphone, or 0.8ms in C. LZMA on a (441MB) textual dump of that tree is able to reduce it to 22MB, so there’s still significant redundancy in that 53MB. Processing purely sequentially, this program needed about 17 MB of RAM and 17 seconds on my netbook to handle the 177000 node-visits in the Baghdad data file I’m using for testing. Storing it all in memory more or less naïvely, 23 MB, same runtime. Building boxes uses the same memory but takes longer. Switching to not doing delta-encoding internally reduced it to 22 MB. I’ve added per-box string tables with all the way tags; typically these seem to be under half a kilobyte, or about 3% overhead. Adding bounding-box calculations on each way increased processing time slightly. Adding box-tree dumping inflated the Baghdad file to 2+ MB and processing time to 2 minutes, but this might be just a bug where it includes far too many ways in each box, and in particular really large ways. Switching to 32768-byte boxes and 9×9 subdivision reduced the size to 1.1MB and time to 33 seconds; switching to using a real XML parser instead of regular expressions pushed the time back up to 49 seconds and pushed memory usage up to an alarming 27MB. Worse, handling arbitrary tags pushed memory usage up to an unacceptable 36MB, time up to 57 seconds, and output file size up to 1.5 MB, which is actually larger than the PBF file I started with. I got the memory usage back to 25MB or so by parsing node IDs to integers (I guess I had like 11MB allocated to node ID strings). This output file contains 39199 ways (down from 45215, since I changed the branching parameters), while the input contains about 27000, plus about 3000 that aren’t highways. This program needed about 1.1 GB of RAM when processing purely sequentially, and 52 minutes on my netbook, to handle the 12 million node-visits of Argentina, fed from osmosis. Now we can expect that it will need 1.6 GB, which is still barely acceptable. Osmosis itself uses about 10 or 11 minutes of that. The output file was 23 MB, which was about 2 bytes per node visit; I can only assume that this is because most of the ways are omitted. Naming ------ The boxes are called “boxes” instead of “tiles”, because if I say “tiles” people will think they’re raster tiles. “Leaves” or “hojas” or “folhas” conflicts with the “tree”-related meaning: boxes exist at all levels of the tree, including the root. “Pages” too strongly suggests virtual memory. “Columns”, like on a scroll, suggests that they’re vertical strips. So “Boxes” it is, although there’s a company called “Mapbox”. Then this is a “box tree file”. “Squares” would work, but unfortunately, Squaretree Software is a custom business software house in Sacramento. """ import re import struct import sys import __builtin__ import xml.sax.saxutils import xml.sax.expatreader import xml.sax.xmlreader import bytesquish def main(argv): if len(argv) == 1: return encode_stdin_to_stdout() elif argv[1] == '-d': # Decode the file sequentially. return decode(sys.stdin) elif argv[1] == '-t': # Walk the file by pointers. return walk(sys.stdin) else: sys.stderr.write("Usage: %s [-d|-t] outfile\n" % argv[0]) return -1 magic = 'Buxus microfila\n' def decode(infile): b = byte_iterator(infile) for magic_letter in magic: letter = b.next() if magic_letter != letter: print 'magic mismatch: %r ≠ %r' % (magic, letter) while True: try: t = b.next() except StopIteration: break if t == '\n': # Type: way way_id, lats, lons, keys, values = decode_way(b) print "way", way_id for lat, lon in zip(lats, lons): print lat / 1e5, lon / 1e5 for k, v in zip(keys, values): print "%r → %r" % (k, v) elif t == 'T': # string table for e.g. highway types for i, ss in enumerate(decode_string_table(b)): print "string", i, "=", repr(ss) elif t == 'D': for bbox, offset in decode_directory(b): print "%s at byte %s" % (bbox_repr(bbox), offset) print "directory offset %d" % decode_offset(b) try: garbage = b.next() except StopIteration: break else: print "unexpected trailing garbage %r" % garbage else: print "unknown type %r" % t def walk(fo): buxus = open(fo) print "file directory starts at byte %d" % buxus.directory_offset for box in buxus: print "%s at byte %s:" % (box.bbox_repr(), box.offset) for way_id, lats, lons, keys, values in box: print " way %d:" % way_id for k, v in zip(keys, values): print " %r: %r" % (k, v) for lat, lon in zip(lats, lons): print " %.5f %.5f" % (lat / 1e5, lon / 1e5) def open(fo): if isinstance(fo, basestring): fo = __builtin__.open(fo) SEEK_END = 2 fo.seek(0) walk_requires_equality(fo.read(len(magic)), magic) fo.seek(-8, SEEK_END) directory_offset = decode_offset(byte_iterator(fo)) fo.seek(directory_offset) directory = list(decode_directory(byte_iterator(fo))) return Buxus(fo, directory_offset, directory) class Buxus: def __init__(self, fo, directory_offset, directory): self.fo = fo self.directory_offset = directory_offset self.directory = directory def __iter__(self): for box in self.directory: yield BuxusBox(self.fo, box) def bbox(self): return reduce(bbox_union, (bbox for bbox, offset in self.directory)) def ways(self, bbox): for box_bounds, offset in self.directory: if bbox_intersects(bbox, box_bounds): for way in BuxusBox(self.fo, (box_bounds, offset)): if bbox_intersects(bbox, compute_bbox(way[1:3])): yield way class BuxusBox: def __init__(self, fo, (bbox, offset)): self.fo = fo self.bbox = bbox self.offset = offset def bbox_repr(self): return bbox_repr(self.bbox) def __iter__(self): self.fo.seek(self.offset) b = byte_iterator(self.fo) ways = [] while True: try: t = b.next() except StopIteration: raise TreeWalkError("premature eof at", self.fo.tell()) if t == '\n': ways.append(decode_way(b)) elif t == 'T': strings = decode_string_table(b) break else: msg = 'either T or \\n at byte %d' % self.fo.tell() walk_requires_equality(t, msg) for way_id, lats, lons, keys, values in ways: yield (way_id, lats, lons, [strings[k] for k in keys], [strings[v] for v in values]) def walk_requires_equality(a, b): if a != b: raise TreeWalkError("Surprisingly different; despairing", a, b) class TreeWalkError(Exception): pass def byte_iterator(f): for line in f: for b in line: yield b def decode_way(b): way_id = bytesquish.load_int(b) lats = bytesquish.load_array(b) lons = bytesquish.load_array(b) for i in range(1, len(lats)): lats[i] += lats[i-1] lons[i] += lons[i-1] keys = bytesquish.load_array(b) values = bytesquish.load_array(b) return way_id, lats, lons, keys, values def decode_string_table(b): _ = b.next() return [bytesquish.load_bytes(b) for i in range(bytesquish.load_int(b))] def decode_directory(b): _ = b.next() n = bytesquish.load_int(b) for i in range(n): lat_min = bytesquish.load_int(b) lon_min = bytesquish.load_int(b) lat_max = bytesquish.load_int(b) lon_max = bytesquish.load_int(b) yield (lat_min, lon_min, lat_max, lon_max), bytesquish.load_int(b) def decode_offset(b): offset = ''.join(b.next() for i in range(8)) return struct.unpack(">Q", offset)[0] def encode_stdin_to_stdout(): way_store = WayStore() p = xml.sax.expatreader.create_parser() handler = WayStoreHandler(way_store) p.setContentHandler(handler) p.setErrorHandler(xml.sax.ErrorHandler()) source = xml.sax.xmlreader.InputSource() source.setByteStream(sys.stdin) source.setEncoding("UTF-8") p.parse(source) sys.stderr.write("%d node ids from %r to %r\n" % ( len(handler.node_coords), min(handler.node_coords.iterkeys()), max(handler.node_coords.iterkeys()))) del p, source, handler out = CountingWriter(sys.stdout) out.write("Buxus microfila\n") directory = dump_tree(way_store, out) dump_directory(directory, out) class WayStoreHandler(xml.sax.handler.ContentHandler): def __init__(self, store): self.store = store self.node_coords = {} self.way_id = None def startElement(self, name, attrs): if name == 'nd': if self.way_id is None: return coords = self.node_coords.get(int(attrs['ref'])) if coords is None: # XXX missing node. For now, cope as well as we can. # Commented for now, but when uncommented, this # demonstrates some kind of serious bug in the # tree-building code. # way_lats.append(0) # way_lons.append(0) return lat, lon = coords self.way_lats.append(lat) self.way_lons.append(lon) elif name == 'node': self.node_coords[int(attrs['id'])] = (angle(attrs['lat']), angle(attrs['lon'])) elif name == 'tag': if self.way_id is not None: self.tag_names.append(attrs['k'].encode('utf-8')) self.tag_values.append(attrs['v'].encode('utf-8')) elif name == 'way': self.way_id = int(attrs['id']) self.tag_names = [] self.tag_values = [] self.way_lats = [] self.way_lons = [] def endElement(self, name): if name == 'way': if 'highway' in self.tag_names: self.store.add_way(self.way_id, self.way_lats, self.way_lons, self.tag_names, self.tag_values) self.way_id = None del self.way_lats, self.way_lons, self.tag_names, self.tag_values def dump_tree(store, out, level=0): sys.stderr.write("%sdumping %s " % (' ' * level, store)) directory = [(store.bbox(), out.count)] if store.dump(out, store.universe(), level=level): return directory for box in subdivide(store.bbox(), 3, 3): substore = Clipped(store, box) if len(substore) and len(substore) < len(store): directory.extend(dump_tree(substore, out, level+1)) return directory def dump_directory(directory, out): start = out.count out.write("D\n" + bytesquish.dumps_int(len(directory))) for bbox, pos in directory: out.write(''.join(bytesquish.dumps_int(item) for item in bbox) + bytesquish.dumps_int(pos)) out.write(struct.pack(">Q", start)) class WayStore: """Stores ways in memory before sending them to an output. This way, we can order them by importance, and we can dump subsets of them. """ def __init__(self): self.way_ids = [] self.way_lat_lists = [] self.way_lon_lists = [] self.tag_lists = [] self.value_lists = [] self.lat_min = None self.lon_min = None self.lat_max = None self.lon_max = None self.order = None def add_way(self, way_id, way_lats, way_lons, tags, values): self.order = None if not way_lats: raise EmptyWay(way_id) self.way_ids.append(way_id) self.way_lat_lists.append(way_lats) self.way_lon_lists.append(way_lons) self.tag_lists.append(tags) self.value_lists.append(values) self.update_bbox(compute_bbox((way_lats, way_lons))) def update_bbox(self, (lat_min, lon_min, lat_max, lon_max)): if self.lat_min is None: self.lat_min = lat_min self.lon_min = lon_min self.lat_max = lat_max self.lon_max = lon_max else: self.lat_min = min(lat_min, self.lat_min) self.lon_min = min(lon_min, self.lon_min) self.lat_max = max(lat_max, self.lat_max) self.lon_max = max(lon_max, self.lon_max) def get_coordinates(self, i): return self.way_lat_lists[i], self.way_lon_lists[i] def universe(self): if self.order: return self.order highway_types = list([value for k, value in zip(ks, vs) if k == 'highway' ][0] for ks, vs in zip(self.tag_lists, self.value_lists)) # “road” is the highway type for roads of unknown type. # (“Unclassified” is a particular kind of highway; the tag is # taken from the confusing nomenclature of the English highway # system, and it does not mean that the road is a road of # unknown type. That’s “road”.) So here we treat roads whose # importance is not in our importance list as being equal to # “road”. importance = list('motorway trunk primary secondary tertiary' ' unclassified residential road service' ' living_street pedestrian track ' ' footway bridleway raceway'.split()) highway_ordering = dict((k, importance.index(k) if k in importance else importance.index('road')) for k in highway_types) highway_key = list(highway_ordering[highway_type] for highway_type in highway_types) self.order = sorted(range(len(self)), key=highway_key.__getitem__) return self.order def __len__(self): return len(self.way_ids) def __repr__(self): return '' % (bbox_repr(self.bbox()), nways(len(self))) def bbox(self): return self.lat_min, self.lon_min, self.lat_max, self.lon_max def dump(self, out, subset, level, limit=32768): out = CountingWriter(out) done = True strings = [] for j, i in enumerate(subset): tags, values = self.tag_lists[i], self.value_lists[i] for ss in tags + values: if ss not in strings: strings.append(ss) # Use a single byte of type information instead of two # because the average size of a way is something like 16 # or 32 bytes, so adding an extra byte of overhead would # be bad. out.write('\n' + bytesquish.dumps_int(self.way_ids[i]) + bytesquish.dumps_array(deltas(self.way_lat_lists[i])) + bytesquish.dumps_array(deltas(self.way_lon_lists[i])) + bytesquish.dumps_array(strings.index(tag) for tag in tags) + bytesquish.dumps_array(strings.index(value) for value in values)) if out.count > limit: done = False breakpoint = out.count break table_start = out.count out.write('T\n' + bytesquish.dumps_int(len(strings)) + ''.join(bytesquish.dumps_bytes(ss) for ss in strings)) sys.stderr.write("%d strings are %d of %d bytes\n" % (len(strings), out.count - table_start, out.count)) if not done: sys.stderr.write( "%sBox full after %s (%d b) of %d\n" % (' ' * level, nways(j), breakpoint, len(subset))) return done def deltas(items): last = 0 for item in items: yield item - last last = item class CountingWriter: def __init__(self, out): self.out = out self.count = 0 def write(self, s): self.count += len(s) self.out.write(s) def compute_bbox((way_lats, way_lons)): return (min(way_lats), min(way_lons), max(way_lats), max(way_lons)) def clipped_subset(store, bbox): for i in store.universe(): way_coords = store.get_coordinates(i) way_bbox = compute_bbox(way_coords) if bbox_intersects(way_bbox, bbox) and (first_is_smaller(way_bbox, bbox) or crosses(way_coords, bbox)): yield i def first_is_smaller(first, second): return ((first[1] - first[0]) < (second[1] - second[0]) or (first[3] - first[2] < (second[3] - second[2]))) def crosses((lats, lons), (lat_min, lon_min, lat_max, lon_max)): # XXX this will miss ways that cross the bbox entirely without # touching it, e.g. because you’re super zoomed in return any(lat_min <= lat <= lat_max and lon_min <= lon <= lon_max for lat, lon in zip(lats, lons)) def subdivide((lat_min, lon_min, lat_max, lon_max), nlat, nlon): lats = [lat_min + (lat_max - lat_min) / nlat * i for i in range(nlat)] + [lat_max] lons = [lon_min + (lon_max - lon_min) / nlon * i for i in range(nlon)] + [lon_max] for i in range(nlat): for j in range(nlon): yield (lats[i], lons[j], lats[i+1], lons[j+1]) class Clipped: def __init__(self, parent, bbox): self.parent = parent self._bbox = bbox self.subset = list(clipped_subset(parent, bbox)) def __repr__(self): return '' % (bbox_repr(self._bbox), nways(len(self.subset))) def universe(self): return self.subset def __len__(self): return len(self.subset) def dump(self, out, subset, level): return self.parent.dump(out, subset, level) def get_coordinates(self, i): return self.parent.get_coordinates(i) def bbox(self): return self._bbox def bbox_intersects((lat_min_a, lon_min_a, lat_max_a, lon_max_a), (lat_min_b, lon_min_b, lat_max_b, lon_max_b)): return (interval_intersects((lat_min_a, lat_max_a), (lat_min_b, lat_max_b)) and interval_intersects((lon_min_a, lon_max_a), (lon_min_b, lon_max_b))) def bbox_repr((lat_min, lon_min, lat_max, lon_max)): return '(%.5f + %.5f, %.5f + %.5f)' % (lat_min / 1e5, (lat_max - lat_min) / 1e5, lon_min / 1e5, (lon_max - lon_min) / 1e5) def bbox_union((a, b, c, d), (e, f, g, h)): return min(a, e), min(b, f), max(c, g), max(d, h) def nways(n): return '1 way' if n == 1 else '%d ways' % n def interval_intersects((min_a, max_a), (min_b, max_b)): return (max_a >= min_b) == (max_b >= min_a) assert(interval_intersects((0, 1), (1, 2))) assert(interval_intersects((0, 2), (1, 2))) assert(interval_intersects((0, 3), (1, 2))) assert(interval_intersects((1, 2), (0, 3))) assert(interval_intersects((1, 2), (0, 2))) assert(interval_intersects((1, 2), (0, 1))) assert(not interval_intersects((0, 1), (2, 3))) assert(not interval_intersects((2, 3), (0, 1))) def angle(text): f = float(text) # 10 micro-degrees is about a meter, or less. f *= 1e5 return int(round(f)) if __name__ == "__main__": sys.exit(main(sys.argv))