\ sample code from http://www.yosefk.com/blog/my-history-with-forth-stack-machines.html
: mean_std ( sum2 sum inv_len — mean std )
\ precise_mean = sum * inv_len;
tuck u* \ sum2 inv_len precise_mean
\ mean = precise_mean >> FRAC;
dup FRAC rshift -rot3 \ mean sum2 inv_len precise_mean
\ var = (((unsigned long long)sum2 * inv_len) >> FRAC) - (precise_mean * precise_mean >> (FRAC*2));
dup um* nip FRAC 2 * 32 - rshift -rot \ mean precise_mean^2 sum2 inv_len
um* 32 FRAC - lshift swap FRAC rshift or \ mean precise_mean^2 sum*inv_len
swap - isqrt \ mean std
;
( We have some fixed-point math here. Let’s admit that, and define
the fixed-point math primitives. It looks like one of sum and inv_len
is a fixed-point fraction, while the other is an ordinary integer; I’m
guessing that sum2 is the sum of the squares, and inv_len is the
reciprocal of the length, which therefore must be a fraction, while
sum and sum2 are ordinary integers. Adopting "." as an indicator
character for fixed-point arithmetic, even though that conflicts with
its usual Forth meaning of “output”: )
: s.* u* ; : .>s FRAC rshift ;
: .* um* 32 FRAC - lshift swap FRAC rshift or ;
\ Now this should be quite straightforward.
variable sumsq variable sum variable len_recip variable .mean
: variance sumsq @ len_recip @ s.* .mean @ dup .* - .>s ;
: mean_std len_recip ! sum ! sumsq !
sum @ len_recip @ s.* .mean !
.mean @ .>s variance isqrt ;
( I believe that the only performance differences between this and the
original version should be:
1. It doesn't use an integer multiply to multiply FRAC by 2 at
run-time!
2. There are some functions that would benefit from being inlined.
This is somewhat clumsy but doable even if the compiler doesn't do
optimization; you just end up with definitions like
: s.* postpone u* ; immediate
and the like. It’s probably also worthwhile to change `32 FRAC -
lshift` to `[ 32 FRAC - ] literal lshift` for a similar reason.
3. It uses memory operations instead of stack operations. The original
uses eight stack operations; this version uses ten memory operations
[plus a stack operation]. Reducing that a bit by using the stack
and/or the return stack would be pretty straightforward. Still, I
think that after sufficient optimization, you’ll always end up with
something as ugly and incomprehensible as the original version.
4. It might be wrong, since I haven’t tested it, and I assume Yossi’s
original code was copy-pasted from working code.
)