// This program demonstrates a piecewise-linear approximation of the
// ℓ² norm √(x² + y²) without multiplications, divisions, or square
// roots, for use when you don’t have a hardware multiplier; it
// requires 11 arithmetic operations, one temporary register, and no
// branches.  It’s apparently good to about 2.2%, though that surely
// changes if you extend it to more dimensions.  You can express it in
// two lines of C.

// The approximation is (2(ℓ ᪲ + ℓ¹) + (5ℓ¹ ↑ 7ℓ ᪲))/11.2, where:
// a ↑ b = a if a ≥ b else b, the pairwise maximum operator;
// ℓ¹ = |x| + |y|, the usual ℓ¹ norm;
// ℓ ᪲ = |x| ↑ |y|, the usual ℓ ᪲ norm.

// Now, you might protest that this doesn’t look very free of
// multiplications or divisions!  It multiplies by 2, 5, 7, and 1/11.2
// (or divides by their reciprocals).  We can handwave away the final
// scaling factor of 1/11.2 because in general handling such scaling
// factors can be delayed until the end of a calculation or handled by
// prescaling of thresholds.  Multiplication by 2 in fixed point is a
// bit shift; multiplication by 5 is a bit shift and addition;
// multiplication by 7 is a bit shift and subtraction.

// I derived this approximation by fiddling around with fudge factors.
// Its contours are hexadecagons that are not quite regular.

// I think it’s is still a norm in the sense that it satisfies the
// three norm properties:
// ∀a: ∀b: f(a + b) ≤ f(a) + f(b)
// ∀a: ∀s ∈ ℝ: f(sa) = |s|f(a)
// f(a) = 0 ⇔ a = 0⃗

// The worst-case error of this approximation seems to be about 2.2%,
// and at an assembly level on a machine that can bit-shift its
// operands, we can compute it branchlessly as follows in 11
// operations with one temporary register:

// t0 := -x
// x := t0 ↑ x           -- |x|
// t0 := -y
// y := t0 ↑ y           -- |y|
// t0 := x + y           -- ℓ¹
// y := x ↑ y            -- now we store ℓ ᪲ in y
// x := t0 + y           -- and ℓ¹ + ℓ ᪲ in x
// t0 := t0 + (t0 << 2)  -- 5ℓ¹
// y := (y << 3) - y     -- 7ℓ ᪲
// t0 := t0 ↑ y          -- 5ℓ¹ ↑ 7ℓ ᪲
// rv := t0 + (x << 1)

// I don’t think any shipped instruction set architecture
// can express this in fixed-point math in 11 instructions.
// ARM needs two instructions for two of the 4 “↑” operations,
// though the others can use the status flags from the negation.
// RISC-V assembly requires two instructions
// for all four; Thumb-2 requires an extra `it` for each.  If
// we need to do bit shifts as separate operations, as on RISC-V, we
// need a second temporary register and three more operations.  I
// think that gives us:

// ARM: 13 instructions
// Thumb-2: 17 instructions
// RISC-V: 18 instructions

// And on RISC-V it’s not branchless.

// As far as I know, though, every ARM chip ever made except the ARM1
// has hardware multiply.  So there’s no reason you’d ever want to use
// this algorithm on an ARM!

// I don’t care how many x86 instructions it is because there aren’t
// any 1-instruction-per-clock x86 implementations anyway, and almost
// all the x86 chips people use this millennium have not only hardware
// multipliers but hardware floating-point.

/* In graphviz:

digraph approxl2 {
  rankdir=LR;
  {x -> "-x"} -> "|x|" -> {ℓ¹ ℓ ᪲} -> "ℓ ᪲ + ℓ₁" -> result;
  {y -> "-y"} -> "|y|" -> {ℓ¹ ℓ ᪲};
  ℓ¹ -> "5ℓ¹" -> "5ℓ¹ ↑ 7ℓ ᪲" -> result;
  ℓ ᪲ -> "7ℓ ᪲" -> "5ℓ¹ ↑ 7ℓ ᪲";
  result [label="2(ℓ ᪲ + ℓ¹) + (5ℓ¹ ↑ 7ℓ ᪲)"];
}

Looking at this in `dot -Tx11` makes it clear that the longest path
length is 6 operations, 5 if we count the absolute-value function as
one operation.  So we can reasonably expect a superscalar machine to
do it in 5 or 6 cycles — probably irrelevant because why would you
ever put superscalar execution on a CPU that doesn’t even have a
hardware multiplier?  Maybe it’s relevant for horizontal-microcode or
VLIW implementations.

On an FPGA you’d probably pipeline it with 16 registers, 6 cycles of
latency, and one result per cycle.  I guess there are different
tradeoffs available; you could in theory do it with only three
registers and a 6-cycle state machine, but that would probably take up
actually more area rather than less, but you could probably shorten
the pipeline by not registering as many intermediate results.

 */

#include <math.h>
#include <stdio.h>
#include <stdlib.h>

enum { NPOINTS = 1000 };
const double scale = 4.0;

typedef struct {
  double err, x, y, z, approx, actual;
} spot;

// This already exists in ISO C99, so we don’t need to define it:
// static inline double fmax(double a, double b)
// {
//   return a >= b ? a : b;
// }

// This program approximates the worst-case error of the approximation.
int main(int argc, char **argv)
{
  spot here = {0.0}, worst = {0.0};
  for (size_t i = 0; i != NPOINTS; i++) {
    here.x = i * scale * 2 / (NPOINTS - 1) - scale;

    for (size_t j = 0; j != NPOINTS; j++) {
      here.y = j * scale * 2 / (NPOINTS - 1) - scale;

      // Commented-out code to test it in three dimensions, where it
      // does almost five times worse:
      // for (size_t k = 0; k != NPOINTS; k++) {
      {
        here.z = 0; //k * scale * 2 / (NPOINTS - 1) - scale;

        // Compute the correct value of the ℓ² norm.
        here.actual = sqrt(here.x*here.x + here.y*here.y + here.z*here.z);

        // Compute the approximation.  L8 is the ℓ ᪲ norm; L1 is the ℓ¹ norm.
        double h = fabs(here.x), v = fabs(here.y), w = fabs(here.z),
          L8 = fmax(h, fmax(v, w)), L1 = h + v + w;
        here.approx = (2*(L8 + L1) + fmax(5*L1, 7*L8)) / 11.2;

        // Evaluate approximation quality.
        here.err = fabs(here.approx - here.actual);
        if (here.err > worst.err) worst = here;
      }
    }
  }

  printf("Worst error of the approximation in [%f, %f]ⁿ:\n", -scale, scale);
  printf("%f at (%f, %f, %f) (computed %f, actual %f, %.2f%% error)\n",
         worst.err, worst.x, worst.y, worst.z,
         worst.approx, worst.actual,
         100*fabs(worst.err / worst.actual));
  return 0;
}