Can someone please verify that the slowness that I see with the MS library implementation of erfcf(x)
is reproducible elsewhere. It might just be a benchmarking quirk. Also any suggestions for command line options to make gcc
compiled code run faster.
I have adopted @Njuffa's test harness and function erfcf(x)
for my experiments on better approximations of the functions that I am interested in. Because the benchmarks take a while and I like to have a progress bar I added a print a "." every so often. The dot printing stalls at tricky value ranges when the range of possible values is swept linearly. I'm using that to guide optimisations.
It has the advantage of system library support for float erfcf(x)
and an accurate double erfc(x)
as a reference to test against. In the process of testing I have noticed that the Intel 2024.1 ICX
library is both fast and accurate, but there are some oddities with the Microsoft MSVC 17.1
implementation and inexplicable (to me) slowness in the dummy loop test when it is compiled with gcc 13.1
. I suspect my lack of familiarity with gcc and/or systematic errors arising from running it in a virtual machine may be to blame. I'd be grateful for timings on a native Linux system for comparison. Compiler options are -O3
inline
anything for maximum speed and FP mode precise
.
This is the minimum reproducible example.
#include <stdio.h>#include <stdint.h>#include <string.h>#include <time.h>#include <math.h>//#define RANDOM (1)// helper routinesfloat uint32_as_float(uint32_t a){ float r; memcpy(&r, &a, sizeof r); return r;}uint32_t float_as_uint32(float a){ uint32_t r; memcpy(&r, &a, sizeof r); return r;}bool my_isnan(float x){ // used here so that -ffast-math can't optimise it away (as it does with system isnan() // other options can be used here // return __isnan((double)x); // return __isnanf(x); uint32_t ix = float_as_uint32(x); // return !((~ix) & 0x7f800000); // Intel's choice is slightly faster return (ix & 0x7f800000) == 0x7f800000;}float erfcdf(float x){ return (float)erfc((double)x); // shim for calling double precision erfc}float dummy(float x) { return x; // to determine loop overheads}void timefun(const char* name, float (*test_fun)(float), bool verbose){ uint32_t argi, largi = 0; float arg, res, sum; time_t start, end; printf("\nTiming %s\n", name); argi = 0; sum = 0.0; start = clock(); do { arg = uint32_as_float(argi); res = (*test_fun)(arg); if (!my_isnan(res)) sum += res;#ifdef RANDOM argi = (argi * 1664525 + 1013904223); // ranqd1 #else argi++;#endif if (verbose && ((argi & 0xff800000) != largi)) { end = clock(); largi = argi & 0xff800000; printf("Exp %x : %6.3f\n", argi >> 20, (float)(end - start) / CLOCKS_PER_SEC); start = clock(); } if ((argi & 0x3ffffff) == 0) printf("."); } while (argi); end = clock(); printf("\ntime taken %6.2f sum = %g\n", (float)(end - start) / CLOCKS_PER_SEC, sum);}int main(void){ timefun("dummy", dummy, false); timefun("erfcf", erfcf, false); timefun("erfcdf", erfcdf, false); return 0;}
It should compile as is on any of gcc
, Intel
or MS
compilers. I'd be interested in benchmarks on other compilers too.
These are the figures. Linear tests every possible bit pattern of x
(including Nans
) in sequence from 0
to 0xffffffff
. Random uses ranqd1 to break any branch prediction and still execute every possible value of x
.
Compiler | Dummy Linear | Linear erfcf | Linear erfc | Rand Dummy | Rand erfcf | Rand erfc |
---|---|---|---|---|---|---|
gcc 13.1 | 31 | 31.6 | 51.1 | 38.7 | 192 | 169 |
Intel 2024.1 | 2.0 | 37.6 | 45.0 | 3.9 | 46.9 | 54.3 |
MS 17.1 | 2.0 | 96.2 | 47.6 | 4.0 | 275 | 125 |
I'm unhappy with the behaviour of GCC
with the dummy(x)
routine which simply returns x
. It is an order of magnitude slower than the others. It also seems far too slow when fed pseudo random data. I presume that my lack of familiarity with that compiler has led to it being a lot slower than it should be. The summation for dummy will overflow so I suspect that I haven't got the compiler options right to handle that situation efficiently.
The command line I'm using to build it on Linux is:
gcc -O3 -march=native -mavx2 -finline-functions -Winline erfc_njaffa6.cpp -lm
I am fairly convinced that the MS slowness is due to naive processing of denorms in polynomials which can be very slow. Break into the debugger when the dots stall and you will see denorm values for x
. MS
is the only one where float erfcf
speed is 2x slower than erfc
for doubles. GCC
and Intel
the float linear versions of code are both marginally faster ~15% (as you might expect). GCC
struggles and is slow for the pseudo random test and I don't understand why. Any suggestions how to make it run faster?
There is also a question of accuracy. My benchmark for that shows that all of the double erfc(x)
implementations give 0.5 ULP when rounded to float (not surprisingly), but that only the Intel library implementation for float erfcf(x)
achieves sub ULP accuracy (0.82 ULP worst case). GCC
comes second with 2.8 ULP and MS
trails in third with 5.7 ULP.
So that the MS float erfcf
implementation is not only twice as slow as their double implementation but also 10x less accurate as well. Even with the shims to convert to and from double MS erfc(x)
is 2x faster than erfcf(x)
. Most odd...