// Not only don’t compilers usually beat humans on complicated code; // they don’t even usually beat humans on *trivial* code. Here we’re // comparing the implementation of `return this->i++;` with the // `this->` implied. See uniqueid.cc. GCC 12.2.0 (Debian 12.2.0-14) // produces the following elapsed times compiled with -O in three // trials: // // - GCC 12 -O control (1 getUniqueId in inner loop): 2.20s, 2.21s, // 2.21s // - asm control: 1.96s, 1.95s, 1.96s // - GCC 12 -O experimental (2 getUniqueIds in inner loop): 4.39s, // 4.43s, 4.44s // - asm experimental: 3.85s, 3.85s, 3.85s // // The difference from the control to the experimental is 2.18–2.23 // seconds, which is 2.18–2.23 nanoseconds per invocation of the // method written in C++. The corresponding number for the assembly // routine is 1.90–1.91. So the assembly version is about 15% faster. // You’d expect it to be faster from looking at the GCC-emitted // assembly, which is obviously stupid: // // 112a: 8b 07 mov (%rdi), %eax // 112c: 8d 50 01 lea 0x1(%rax), %edx // 112f: 89 17 mov %edx, (%rdi) // 1131: c3 ret // // As opposed to the obvious mov eax, [rdi]; inc dword ptr [rdi]; ret, // which is what I’m testing it against. // // But sometimes looks can be deceiving, especially on superscalar and // out-of-order processors. However, you’d naïvely expect that four // instructions plus two memory accesses would be about 15% faster // than three instructions plus two memory accesses, and that naïve // expectation turns out to be correct. (More naïvely still, you’d // expect the “obvious” version to have an additional memory access // and be the same speed.) // // Compiler flags make no difference at all, and clang is just as bad, // the only difference in the generated code being that it uses ECX // rather than EDX for its temporary. // // With -Os we get: // - GCC 12 -Os control: 2.21s, 2.27s, 2.24s // - asm -Os control: 1.93s, 1.94s, 1.96s // - GCC 12 -Os experimental: 4.42s, 4.42s, 4.44s // - asm -Os experimental: 3.89s, 3.89s, 3.85s // // With -O5 we get: // - GCC 12 -O5 control: 2.22s, 2.22s, 2.24s // - asm -O5 control: 1.93s, 1.96s, 1.99s // - GCC 12 -O5 experimental: 4.41s, 4.43s, 4.48s // - asm -O5 experimentasl: 3.89s, 3.87s, 3.96s // // With Debian clang version 14.0.6 clang++ -O we get: // - clang -O control: 2.26s, 2.24s, 2.21s // - asm -O experimental: 1.96s, 1.93s, 1.96s // - clang -O experimental: 4.40s, 4.44s, 4.55s // - asm -O experimental: 3.90s, 3.89s, 3.89s // // With clang -Oz we get: // - clang -O control: 2.28s, 2.24s, 2.21s // - asm -O control: 1.93s, 1.93s, 1.96s // - clang -O experimental: 4.40s, 4.50s, 4.44s // - asm -O experimental: 3.90s, 3.88s, 3.88s // // With clang -O5 we get a warning (optimization level '-O5' is not // supported; using '-O3' instead) and: // - clang -O5 control: 2.20s, 2.22s, 2.21s // - asm -O5 control: 2.03s, 1.93s, 1.97s // - clang -O5 experimental: 4.46s, 4.44s, 4.42s // - asm -O5 experimental: 3.85s, 3.94s, 3.89s #include "uniqueid.h" int main(int argc, char **argv) { int n = 1000 * 1000 * 1000; UniqueId id; for (int i = 0; i != n; i++) { id.getUniqueId(); //id.getUniqueId(); // comment this out for control group } return 0; }