/* Fast Walsh–Hadamard Transform in GF(251).

Produces a “renormalized” FWHT of its 64-byte input in this Galois
field; < f ./fwht | ./fwht should reproduce (the first 64 bytes of)
the contents of f.

The idea here is to demonstrate the self-inverse property of the
transform; however, rounding and truncation issues interfere.  As an
example, here we see that applying the Walsh–Hadamard step to two
8-bit values produces a value outside the 8-bit range:

    >>> a, b = 68, 102
    >>> a, b = (a + b), (a - b)
    >>> a, b
    (170, -34)

Repeating the same transformation again we recover the original
numbers, but scaled by a factor of 2:

    >>> a, b = (a + b), (a - b)
    >>> a, b
    (136, 204)

By dividing by 2 we can recover the original numbers:

    >>> a/2, b/2
    (68, 102)

But in machine words of any fixed size (8 bits, 32 bits, whatever)
dividing by 2 is an information-losing operation.  So, we could do our
transforms mod 241, which is a prime, thus giving us a Galois field.
The multiplicative inverse of 2 mod 241 is 121, which has two square
roots in this field: the usual 11 and also 230.  So, each time we
compute a sum and difference, we multiply by 11 (one of the square
roots of 2⁻¹) to “normalize” the result:

    >>> a, b = 68, 102
    >>> a, b = (a + b) * 11 % 241, (a - b) * 11 % 241
    >>> a, b
    (183, 108)
    >>> a, b = (a + b) * 11 % 241, (a - b) * 11 % 241
    >>> a, b
    (68, 102)

If we want the transform to be information-preserving, doing the
internal steps of the algorithm in higher precision doesn’t help to
avoid this modulo and renormalization step, because the final output
of the first transform is the thing that has to be in higher
precision.

Mod 241 preserves printable ASCII but would of course err on bytes
with values over 240.  So it only handles 7.91 bits of data per byte,
not 8.

Mod 251 includes more byte values (7.97 bits of data), but there’s no
√½ mod 251.  But there *is* a √-½: 80 or 171; so a single level of the
FWHT isn’t self-inverse (when done twice it produces the additive
inverse of the input) but any even number of levels is:

    >>> a, b = 68, 102
    >>> a, b = (a + b) * 80 % 251, (a - b) * 80 % 251
    >>> a, b
    (46, 41)
    >>> a, b = (a + b) * 80 % 251, (a - b) * 80 % 251
    >>> a, b
    (183, 149)

These are the negatives of the original values:

    >>> -68 % 251, -102 % 251
    (183, 149)

So, repeating the transform step twice more we recover the original
values:

    >>> a, b = (a + b) * 80 % 251, (a - b) * 80 % 251
    >>> a, b
    (205, 210)
    >>> a, b = (a + b) * 80 % 251, (a - b) * 80 % 251
    >>> a, b
    (68, 102)

Or, more briefly:

    >>> f = lambda a, b: ((a + b) * 80 % 251, (a - b) * 80 % 251)
    >>> f(*f(*f(*f(68, 102))))
    (68, 102)

This means that to get a self-inverse FWHT with GF(251) the size of
the input (and output) needs to be a power of 4, not just 2.

It would be more efficient to do the renormalization just once, for
example by multiplying by 51, or once per transform, for example by
multiplying by 94 twice, instead of on every transform step, by
multiplying by 80 12 times, as we do.  But maybe it would actually be
more efficient to do the whole thing in GF(2⁸) instead!

Unfortunately in a finite field like this the renormalization factors
totally kill the continuity property of the usual FWHT, which is what
I’m actually interested in.  That is, in the usual FWHT, an
arbitrarily small change to any input point results in a
proportionally small change to every output point, but there’s no
applicable metric in GF(251).  You might think this would be useful
for cryptography, because it’s roughly the easiest way to get the
avalanche effect, but it’s completely linear (in GF(251)).  I’m no
cryptographer, but I feel like that linearity could be a big drawback.

 */
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>


typedef uint8_t u8;
typedef uint32_t u32;

int main(int argc, char **argv)
{
  u32 n = 0;
  if (argc == 2) n = atoi(argv[1]);

  /* see https://stackoverflow.com/questions/3436922/evaluate-whether-a-number-is-integer-power-of-4 */
  if ((n & (n - 1)) != 0 || (n & 0x55555555) == 0) {
    fprintf(stderr, "Usage: %s nbytes < data, where nbytes is a power of 4\n",
            argv[0]);
    return -1;
  }

  u8 *line = malloc(n);
  if (!line) {
    perror("malloc");
    return -1;
  }

  fprintf(stderr, "Expecting %d char%s of input.\n", n, (n == 1) ? "" : "s");

  for (size_t i = 0; i < n; i++) {
    int c = getchar();
    if (c < 0) {
      fprintf(stderr, "Only %d chars of input\n", (int)i);
      return -1;
    }
    line[i] = c;
  }

  /* s, the stride, identifies which dimension of the
   * lg(n)-dimensional hypercube we’re doing the DFT along in a given
   * iteration; i and j locate points in the other lg(n)-1
   * dimensions. */
  for (size_t s = 1; s < n; s <<= 1) {
    for (size_t i = 0; i < n; i += 2*s) {
      for (size_t j = i; j < i + s; j++) {
        size_t k = j + s;
        u8 a = line[j], b = line[k];
        line[j] = (a + b) * 80 % 251;
        /* The extra + 251 here is because of C’s semantics for division of negatives. */
        line[k] = (a - b + 251) * 80 % 251;
      }
    }
  }

  for (size_t i = 0; i < n; i++) putchar(line[i]);

  return 0;
}