This web page is a live Burrows-Wheeler transform, the heart of bzip2. I've tried to use the same notation as in Burrows and Wheeler's original 1994 paper, “A Block-sorting Lossless Compression Algorithm” (DEC SRC RR 124) (gatekeeper download) except in a few cases where I thought their choices made the algorithm hard to understand, in which case I have chosen non-conflicting notation.
We carry out the transform of this text S in the below matrices, first making a matrix whose rows are all the rotations of the text, then sorting the rows lexicographically to get matrix M, whose last column L is the output of the transform, along with I, the row number where we find the original text.
M | ||||
---|---|---|---|---|
loading...
|
→ | → | (see below) |
L= (I=)
For the reverse transform, first we sort L to obtain F, the first column of M:
F=
So now we know the first and last columns of M.
Now this is the clever part. We rotate M one column to the right, getting a new matrix M'.
M | M' | |
---|---|---|
computing...
| → | computing... |
Now we need to find all the missing characters, at least on one row.
Because the rows of this new matrix M' are all of the rotations of the original text S, as are the rows of the old matrix M, there is some permutation of rows that transforms M into M'. And we can compute it; we know the rows of M are lexicographically sorted, as are the rows of M' if we disregard the first character. So the kth row of the rows that begin with some letter ch is the same row in both matrices.
So we just need to tabulate (ch, k) pairs for F, which is the first column of M, and for L, which is the first column of M', and we can find the permutation vector TT such that row j of M (beginning with F[j]) is the same as row TT[j] of M' (beginning with L[TT[j]].)
j |
---|
F |
kF |
L |
kL |
TT |
L[TT[j]] |
(If you're reading the original paper, note that this TT is exactly the inverse of the vector T in the original paper, in the sense that T[i] = j iff TT[j] = i.)
This vector tells us how to permute the rows of M to shift it one character to the left.
Consequently, we can repeatedly use it to permute F to get the various columns of M. Or, if we’re only interested in one row of M, such as row I (the original text S), we can use it to jump around in F until we’re done. Slightly more formally, j0 = I, and for i > 0, ji = TT[ji-1].
j |
---|
TT[j] |
F[j] |
And, unless something has gone wrong, F[j] is our original string S.
The transform can be done without allocating memory for an N×N matrix as shown above. The matrix M can be represented merely by a collection of pointers into the source text, which can be sorted according to the text they point to without copying that text. Usually, the comparisons for the sorting won't have to look at more than a few characters of each row, but in the worst case, they may. Consider aaaaaaaaab or xoxoxoxoxo~, or worst of all, aaaaaaaaaa.
However, there are faster algorithms available; it can be done in linear time and excessive space by constructing a “PATRICIA” or suffix tree, or by using the algorithm of Manber and Myers to construct a suffix array. These algorithms were already known in 1994 when Burrows and Wheeler published their paper; further improvements have been made since then by Fenwick, Seward, Sadakane, Kurtz, Larsson, Itoh, Tanaka, Kao, Manzini, and Ferragina.
There’s also the question: “Why is this useful?” After all, the BWT output L is the same size as the input S; it's a permutation of it. It's not compressed! The answer is that the new string L is usually much easier to compress; characters that occur in similar contexts have been moved together. In xoxoxoxoxo~, for example, all of the x’s are moved together because they precede o’s, and all of the o’s are moved together because they precede x’s, except for the one which precedes a ~. In a more complex example such as she sells seashells by the seashore, which transforms to
sseeyee hhsshsrtssseellholl eaa bsimilar characters are still often brought together; for example, it starts with the two s’s that precede spaces, followed by the two e’s and the y that precede other spaces, then the two e's that precede a’s. Then it proceeds to the three h's, three s’s, and one r that precede a’s; and so on. Two of those three s’s are together in the output string because they preceded ea. In a longer text, there would be more.
So each region of the output string is dominated by a few distinct characters, and even within that region, you tend to get clusters of the same character. This makes it very easy to compress with a variety of locally-adaptive techniques. The original paper suggested move-to-front coding followed by huffman coding, but a variety of improvements have been devised over the years.
The comp.compression FAQ has a pretty good section on the BWT, and data-compression.info's page is pretty great too.
·