arrayfrombuffer
The arrayfrombuffer Python package makes
it possible to memory-map files as Numerical Python arrays. It contains
two modules:
- "arrayfrombuffer", an extension module written in C, which creates
Numeric arrays from read-write buffer objects, such as mmap objects.
- "maparray", a pure-Python module that makes it convenient to create
Numeric arrays from files, using arrayfrombuffer.
There is some documentation for arrayfrombuffer and maparray available for
online perusal, and the package is free software and available for
download under an X11-style license.
Evangelism
Memory-mapping files as arrays has the following advantages.
- loading your array from a file is easy --- a module import and a
single function call --- and doesn't use excessive amounts of
memory.
- loading your array is quick; it doesn't need to be copied from one
part of memory to another in order to be loaded.
- your array gets demand-loaded; parts you aren't using don't need to
be in memory or in swap.
- under memory-pressure conditions, your array doesn't use up swap,
and parts of it you haven't modified can be evicted from RAM without
the need for a disk write
- your arrays can be bigger than your physical memory, up to the
limitations imposed by your virtual memory address space and your
OS; admittedly, this used to be a bigger deal when our PCs had 64
mebibytes of
RAM, 2 gibibytes of address space, and 4 gibibytes of
disk space. But on my laptop, which has 128 MiB of RAM and 256 MiB
of swap, I added two 512-mebibyte arrays to produce a third one with
the statement "Numeric.add(a, b, c)." It took eleven minutes,
though.
- when you modify your array, only the parts you modify get written
back out to disk
In theory, you could also use this package for things like arrays in
memory shared between programs or arrays in distributed shared memory.
I haven't tried using it for those things, though.
Someone built a package called "Vmaps" that does something similar, but provides nice things like
atomic compare-and-swap for shared-memory arrays. It was released
2002-01-22.
Usage
Using it is very easy. To create the array file on disk:
open('tmp.foo', 'wb').write(somenumericarray.tostring())
To load it back in as type 'l', flattened:
import maparray
myarray, mymmapobj, myfile = maparray.maparray('tmp.foo')
Now myarray is a perfectly ordinary Numeric array whose data just
happens to be stored in the file 'tmp.foo'.
If you want a different data type, you can specify it:
myarray, mymmapobj, myfile = maparray.maparray('tmp.foo', 'f')
myarray, mymmapobj, myfile = maparray.maparray('tmp.foo', typecode='f')
You can specify a shape as well:
myarray, mymmapobj, myfile = maparray.maparray('tmp.foo', 'f', (-1, 24))
myarray, mymmapobj, myfile = maparray.maparray('tmp.foo', shape=(-1, 24))
If you make changes via myarray and you want them reflected in the
file before you delete all references to the mmap object and the
array, do this:
mymmapobj.flush()
Wishlist
It might be nice to have some of the following features:
using part of a buffer instead of the whole thing (or mmapping part
of a file instead of the whole thing)
not crashing if you close() the mmap object and then access myarray;
unfortunately, this is very difficult, and probably the easiest way
to do it is to provide a version of the mmap module that doesn't
have a close() method.
support for explicit lengths so you can create arrays this way too;
on Unix, you can use ftruncate() to set the length of the file, but
on any OS, you can write a bunch of zero bytes.
kragen@pobox.com |
Kragen's software |
Kragen's home page