Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 21994

MSVC FP library float erfcf(x) unexpectedly slow - 2x slower than double erfc(x). Why?

$
0
0

Can someone please verify that the slowness that I see with the MS library implementation of erfcf(x) is reproducible elsewhere. It might just be a benchmarking quirk. Also any suggestions for command line options to make gcc compiled code run faster.

I have adopted @Njuffa's test harness and function erfcf(x) for my experiments on better approximations of the functions that I am interested in. Because the benchmarks take a while and I like to have a progress bar I added a print a "." every so often. The dot printing stalls at tricky value ranges when the range of possible values is swept linearly. I'm using that to guide optimisations.

It has the advantage of system library support for float erfcf(x) and an accurate double erfc(x) as a reference to test against. In the process of testing I have noticed that the Intel 2024.1 ICX library is both fast and accurate, but there are some oddities with the Microsoft MSVC 17.1 implementation and inexplicable (to me) slowness in the dummy loop test when it is compiled with gcc 13.1. I suspect my lack of familiarity with gcc and/or systematic errors arising from running it in a virtual machine may be to blame. I'd be grateful for timings on a native Linux system for comparison. Compiler options are -O3inline anything for maximum speed and FP mode precise.

This is the minimum reproducible example.

#include <stdio.h>#include <stdint.h>#include <string.h>#include <time.h>#include <math.h>//#define RANDOM  (1)// helper routinesfloat uint32_as_float(uint32_t a){   float r;   memcpy(&r, &a, sizeof r);   return r;}uint32_t float_as_uint32(float a){   uint32_t r;   memcpy(&r, &a, sizeof r);   return r;}bool my_isnan(float x){  //    used here so that -ffast-math can't optimise it away (as it does with system isnan()  //    other options can be used here  //    return __isnan((double)x);  //    return __isnanf(x);  uint32_t ix = float_as_uint32(x);  //    return !((~ix) & 0x7f800000);  // Intel's choice is slightly faster  return (ix & 0x7f800000) == 0x7f800000;}float erfcdf(float x){  return (float)erfc((double)x);  // shim for calling double precision erfc}float dummy(float x)  {  return x;  // to determine loop overheads}void timefun(const char* name, float (*test_fun)(float), bool verbose){  uint32_t argi, largi = 0;  float arg, res, sum;  time_t start, end;  printf("\nTiming %s\n", name);  argi = 0;  sum = 0.0;  start = clock();  do {      arg = uint32_as_float(argi);      res = (*test_fun)(arg);      if (!my_isnan(res)) sum += res;#ifdef RANDOM      argi = (argi * 1664525 + 1013904223); // ranqd1 #else      argi++;#endif      if (verbose && ((argi & 0xff800000) != largi))      {          end = clock();          largi = argi & 0xff800000;          printf("Exp %x : %6.3f\n", argi >> 20, (float)(end - start) / CLOCKS_PER_SEC);          start = clock();      }      if ((argi & 0x3ffffff) == 0) printf(".");  } while (argi);  end = clock();  printf("\ntime taken %6.2f  sum = %g\n", (float)(end - start) / CLOCKS_PER_SEC, sum);}int main(void){  timefun("dummy", dummy, false);  timefun("erfcf", erfcf, false);  timefun("erfcdf", erfcdf, false);  return 0;}

It should compile as is on any of gcc, Intel or MS compilers. I'd be interested in benchmarks on other compilers too.

These are the figures. Linear tests every possible bit pattern of x (including Nans) in sequence from 0 to 0xffffffff. Random uses ranqd1 to break any branch prediction and still execute every possible value of x.

CompilerDummy LinearLinear erfcfLinear erfcRand DummyRand erfcfRand erfc
gcc 13.13131.651.138.7192169
Intel 2024.12.037.645.03.946.954.3
MS 17.12.096.247.64.0275125

I'm unhappy with the behaviour of GCC with the dummy(x) routine which simply returns x. It is an order of magnitude slower than the others. It also seems far too slow when fed pseudo random data. I presume that my lack of familiarity with that compiler has led to it being a lot slower than it should be. The summation for dummy will overflow so I suspect that I haven't got the compiler options right to handle that situation efficiently.

The command line I'm using to build it on Linux is:

gcc -O3 -march=native -mavx2 -finline-functions -Winline erfc_njaffa6.cpp -lm

I am fairly convinced that the MS slowness is due to naive processing of denorms in polynomials which can be very slow. Break into the debugger when the dots stall and you will see denorm values for x. MS is the only one where float erfcf speed is 2x slower than erfc for doubles. GCC and Intel the float linear versions of code are both marginally faster ~15% (as you might expect). GCC struggles and is slow for the pseudo random test and I don't understand why. Any suggestions how to make it run faster?

There is also a question of accuracy. My benchmark for that shows that all of the double erfc(x) implementations give 0.5 ULP when rounded to float (not surprisingly), but that only the Intel library implementation for float erfcf(x) achieves sub ULP accuracy (0.82 ULP worst case). GCC comes second with 2.8 ULP and MS trails in third with 5.7 ULP.

So that the MS float erfcf implementation is not only twice as slow as their double implementation but also 10x less accurate as well. Even with the shims to convert to and from double MS erfc(x) is 2x faster than erfcf(x). Most odd...


Viewing all articles
Browse latest Browse all 21994

Trending Articles