arrayfrombuffer

The arrayfrombuffer Python package makes it possible to memory-map files as Numerical Python arrays. It contains two modules:

"arrayfrombuffer", an extension module written in C, which creates Numeric arrays from read-write buffer objects, such as mmap objects.
"maparray", a pure-Python module that makes it convenient to create Numeric arrays from files, using arrayfrombuffer.

There is some documentation for arrayfrombuffer and maparray available for online perusal, and the package is free software and available for download under an X11-style license.

Evangelism

Memory-mapping files as arrays has the following advantages.

loading your array from a file is easy --- a module import and a single function call --- and doesn't use excessive amounts of memory.
loading your array is quick; it doesn't need to be copied from one part of memory to another in order to be loaded.
your array gets demand-loaded; parts you aren't using don't need to be in memory or in swap.
under memory-pressure conditions, your array doesn't use up swap, and parts of it you haven't modified can be evicted from RAM without the need for a disk write
your arrays can be bigger than your physical memory, up to the limitations imposed by your virtual memory address space and your OS; admittedly, this used to be a bigger deal when our PCs had 64 mebibytes of RAM, 2 gibibytes of address space, and 4 gibibytes of disk space. But on my laptop, which has 128 MiB of RAM and 256 MiB of swap, I added two 512-mebibyte arrays to produce a third one with the statement "Numeric.add(a, b, c)." It took eleven minutes, though.
when you modify your array, only the parts you modify get written back out to disk

In theory, you could also use this package for things like arrays in memory shared between programs or arrays in distributed shared memory. I haven't tried using it for those things, though.

Someone built a package called "Vmaps" that does something similar, but provides nice things like atomic compare-and-swap for shared-memory arrays. It was released 2002-01-22.

Usage

Using it is very easy. To create the array file on disk:

    open('tmp.foo', 'wb').write(somenumericarray.tostring())

To load it back in as type 'l', flattened:

    import maparray
    myarray, mymmapobj, myfile = maparray.maparray('tmp.foo')

Now myarray is a perfectly ordinary Numeric array whose data just happens to be stored in the file 'tmp.foo'.

If you want a different data type, you can specify it:

    myarray, mymmapobj, myfile = maparray.maparray('tmp.foo', 'f')
    myarray, mymmapobj, myfile = maparray.maparray('tmp.foo', typecode='f')

You can specify a shape as well:

    myarray, mymmapobj, myfile = maparray.maparray('tmp.foo', 'f', (-1, 24))
    myarray, mymmapobj, myfile = maparray.maparray('tmp.foo', shape=(-1, 24))

If you make changes via myarray and you want them reflected in the file before you delete all references to the mmap object and the array, do this:

    mymmapobj.flush()

Wishlist

It might be nice to have some of the following features:

knowing whether or not it works on Microsoft Windows
read-only access; this has a couple of problems to solve:
- the mmap module on Microsoft Windows doesn't support read-only access
- the Numeric module doesn't support read-only arrays, as far as I can tell, so the way you'd find out you were trying to write to a read-only mapping would be by a segmentation fault.
using part of a buffer instead of the whole thing (or mmapping part of a file instead of the whole thing)
not crashing if you close() the mmap object and then access myarray; unfortunately, this is very difficult, and probably the easiest way to do it is to provide a version of the mmap module that doesn't have a close() method.
support for explicit lengths so you can create arrays this way too; on Unix, you can use ftruncate() to set the length of the file, but on any OS, you can write a bunch of zero bytes.

kragen@pobox.com | Kragen's software | Kragen's home page