// Not only don’t compilers usually beat humans on complicated code;
// they don’t even usually beat humans on *trivial* code.  Here we’re
// comparing the implementation of `return this->i++;` with the
// `this->` implied.  See uniqueid.cc.  GCC 12.2.0 (Debian 12.2.0-14)
// produces the following elapsed times compiled with -O in three
// trials:
//
// - GCC 12 -O control (1 getUniqueId in inner loop): 2.20s, 2.21s,
//   2.21s
// - asm control: 1.96s, 1.95s, 1.96s
// - GCC 12 -O experimental (2 getUniqueIds in inner loop): 4.39s,
//   4.43s, 4.44s
// - asm experimental: 3.85s, 3.85s, 3.85s
//
// The difference from the control to the experimental is 2.18–2.23
// seconds, which is 2.18–2.23 nanoseconds per invocation of the
// method written in C++.  The corresponding number for the assembly
// routine is 1.90–1.91.  So the assembly version is about 15% faster.
// You’d expect it to be faster from looking at the GCC-emitted
// assembly, which is obviously stupid:
//
// 112a: 8b 07     mov (%rdi), %eax
// 112c: 8d 50 01  lea 0x1(%rax), %edx
// 112f: 89 17     mov %edx, (%rdi)
// 1131: c3        ret
//
// As opposed to the obvious mov eax, [rdi]; inc dword ptr [rdi]; ret,
// which is what I’m testing it against.
//
// But sometimes looks can be deceiving, especially on superscalar and
// out-of-order processors.  However, you’d naïvely expect that four
// instructions plus two memory accesses would be about 15% faster
// than three instructions plus two memory accesses, and that naïve
// expectation turns out to be correct.  (More naïvely still, you’d
// expect the “obvious” version to have an additional memory access
// and be the same speed.)
//
// Compiler flags make no difference at all, and clang is just as bad,
// the only difference in the generated code being that it uses ECX
// rather than EDX for its temporary.
//
// With -Os we get:
// - GCC 12 -Os control: 2.21s, 2.27s, 2.24s
// - asm -Os control: 1.93s, 1.94s, 1.96s
// - GCC 12 -Os experimental: 4.42s, 4.42s, 4.44s
// - asm -Os experimental: 3.89s, 3.89s, 3.85s
//
// With -O5 we get:
// - GCC 12 -O5 control: 2.22s, 2.22s, 2.24s
// - asm -O5 control: 1.93s, 1.96s, 1.99s
// - GCC 12 -O5 experimental: 4.41s, 4.43s, 4.48s
// - asm -O5 experimentasl: 3.89s, 3.87s, 3.96s
//
// With Debian clang version 14.0.6 clang++ -O we get:
// - clang -O control: 2.26s, 2.24s, 2.21s
// - asm -O experimental: 1.96s, 1.93s, 1.96s
// - clang -O experimental: 4.40s, 4.44s, 4.55s
// - asm -O experimental: 3.90s, 3.89s, 3.89s
//
// With clang -Oz we get:
// - clang -O control: 2.28s, 2.24s, 2.21s
// - asm -O control: 1.93s, 1.93s, 1.96s
// - clang -O experimental: 4.40s, 4.50s, 4.44s
// - asm -O experimental: 3.90s, 3.88s, 3.88s
//
// With clang -O5 we get a warning (optimization level '-O5' is not
// supported; using '-O3' instead) and:
// - clang -O5 control: 2.20s, 2.22s, 2.21s
// - asm -O5 control: 2.03s, 1.93s, 1.97s
// - clang -O5 experimental: 4.46s, 4.44s, 4.42s
// - asm -O5 experimental: 3.85s, 3.94s, 3.89s

#include "uniqueid.h"

int main(int argc, char **argv)
{
  int n = 1000 * 1000 * 1000;
  UniqueId id;
  for (int i = 0; i != n; i++) {
    id.getUniqueId();
    //id.getUniqueId();           // comment this out for control group
 }
  return 0;
}