// This program demonstrates a piecewise-linear approximation of the // ℓ² norm √(x² + y²) without multiplications, divisions, or square // roots, for use when you don’t have a hardware multiplier; it // requires 11 arithmetic operations, one temporary register, and no // branches. It’s apparently good to about 2.2%, though that surely // changes if you extend it to more dimensions. You can express it in // two lines of C. // The approximation is (2(ℓ ᪲ + ℓ¹) + (5ℓ¹ ↑ 7ℓ ᪲))/11.2, where: // a ↑ b = a if a ≥ b else b, the pairwise maximum operator; // ℓ¹ = |x| + |y|, the usual ℓ¹ norm; // ℓ ᪲ = |x| ↑ |y|, the usual ℓ ᪲ norm. // Now, you might protest that this doesn’t look very free of // multiplications or divisions! It multiplies by 2, 5, 7, and 1/11.2 // (or divides by their reciprocals). We can handwave away the final // scaling factor of 1/11.2 because in general handling such scaling // factors can be delayed until the end of a calculation or handled by // prescaling of thresholds. Multiplication by 2 in fixed point is a // bit shift; multiplication by 5 is a bit shift and addition; // multiplication by 7 is a bit shift and subtraction. // I derived this approximation by fiddling around with fudge factors. // Its contours are hexadecagons that are not quite regular. // I think it’s is still a norm in the sense that it satisfies the // three norm properties: // ∀a: ∀b: f(a + b) ≤ f(a) + f(b) // ∀a: ∀s ∈ ℝ: f(sa) = |s|f(a) // f(a) = 0 ⇔ a = 0⃗ // The worst-case error of this approximation seems to be about 2.2%, // and at an assembly level on a machine that can bit-shift its // operands, we can compute it branchlessly as follows in 11 // operations with one temporary register: // t0 := -x // x := t0 ↑ x -- |x| // t0 := -y // y := t0 ↑ y -- |y| // t0 := x + y -- ℓ¹ // y := x ↑ y -- now we store ℓ ᪲ in y // x := t0 + y -- and ℓ¹ + ℓ ᪲ in x // t0 := t0 + (t0 << 2) -- 5ℓ¹ // y := (y << 3) - y -- 7ℓ ᪲ // t0 := t0 ↑ y -- 5ℓ¹ ↑ 7ℓ ᪲ // rv := t0 + (x << 1) // I don’t think any shipped instruction set architecture // can express this in fixed-point math in 11 instructions. // ARM needs two instructions for two of the 4 “↑” operations, // though the others can use the status flags from the negation. // RISC-V assembly requires two instructions // for all four; Thumb-2 requires an extra `it` for each. If // we need to do bit shifts as separate operations, as on RISC-V, we // need a second temporary register and three more operations. I // think that gives us: // ARM: 13 instructions // Thumb-2: 17 instructions // RISC-V: 18 instructions // And on RISC-V it’s not branchless. // As far as I know, though, every ARM chip ever made except the ARM1 // has hardware multiply. So there’s no reason you’d ever want to use // this algorithm on an ARM! // I don’t care how many x86 instructions it is because there aren’t // any 1-instruction-per-clock x86 implementations anyway, and almost // all the x86 chips people use this millennium have not only hardware // multipliers but hardware floating-point. /* In graphviz: digraph approxl2 { rankdir=LR; {x -> "-x"} -> "|x|" -> {ℓ¹ ℓ ᪲} -> "ℓ ᪲ + ℓ₁" -> result; {y -> "-y"} -> "|y|" -> {ℓ¹ ℓ ᪲}; ℓ¹ -> "5ℓ¹" -> "5ℓ¹ ↑ 7ℓ ᪲" -> result; ℓ ᪲ -> "7ℓ ᪲" -> "5ℓ¹ ↑ 7ℓ ᪲"; result [label="2(ℓ ᪲ + ℓ¹) + (5ℓ¹ ↑ 7ℓ ᪲)"]; } Looking at this in `dot -Tx11` makes it clear that the longest path length is 6 operations, 5 if we count the absolute-value function as one operation. So we can reasonably expect a superscalar machine to do it in 5 or 6 cycles — probably irrelevant because why would you ever put superscalar execution on a CPU that doesn’t even have a hardware multiplier? Maybe it’s relevant for horizontal-microcode or VLIW implementations. On an FPGA you’d probably pipeline it with 16 registers, 6 cycles of latency, and one result per cycle. I guess there are different tradeoffs available; you could in theory do it with only three registers and a 6-cycle state machine, but that would probably take up actually more area rather than less, but you could probably shorten the pipeline by not registering as many intermediate results. */ #include #include #include enum { NPOINTS = 1000 }; const double scale = 4.0; typedef struct { double err, x, y, z, approx, actual; } spot; // This already exists in ISO C99, so we don’t need to define it: // static inline double fmax(double a, double b) // { // return a >= b ? a : b; // } // This program approximates the worst-case error of the approximation. int main(int argc, char **argv) { spot here = {0.0}, worst = {0.0}; for (size_t i = 0; i != NPOINTS; i++) { here.x = i * scale * 2 / (NPOINTS - 1) - scale; for (size_t j = 0; j != NPOINTS; j++) { here.y = j * scale * 2 / (NPOINTS - 1) - scale; // Commented-out code to test it in three dimensions, where it // does almost five times worse: // for (size_t k = 0; k != NPOINTS; k++) { { here.z = 0; //k * scale * 2 / (NPOINTS - 1) - scale; // Compute the correct value of the ℓ² norm. here.actual = sqrt(here.x*here.x + here.y*here.y + here.z*here.z); // Compute the approximation. L8 is the ℓ ᪲ norm; L1 is the ℓ¹ norm. double h = fabs(here.x), v = fabs(here.y), w = fabs(here.z), L8 = fmax(h, fmax(v, w)), L1 = h + v + w; here.approx = (2*(L8 + L1) + fmax(5*L1, 7*L8)) / 11.2; // Evaluate approximation quality. here.err = fabs(here.approx - here.actual); if (here.err > worst.err) worst = here; } } } printf("Worst error of the approximation in [%f, %f]ⁿ:\n", -scale, scale); printf("%f at (%f, %f, %f) (computed %f, actual %f, %.2f%% error)\n", worst.err, worst.x, worst.y, worst.z, worst.approx, worst.actual, 100*fabs(worst.err / worst.actual)); return 0; }