MSVC FP library float erfcf(x) unexpectedly slow - 2x slower than double erfc(x). Why?

Can someone please verify that the slowness that I see with the MS library implementation of erfcf(x) is reproducible elsewhere. It might just be a benchmarking quirk. Also any suggestions for command line options to make gcc compiled code run faster.

I have adopted @Njuffa's test harness and function erfcf(x) for my experiments on better approximations of the functions that I am interested in. Because the benchmarks take a while and I like to have a progress bar I added a print a "." every so often. The dot printing stalls at tricky value ranges when the range of possible values is swept linearly. I'm using that to guide optimisations.

It has the advantage of system library support for float erfcf(x) and an accurate double erfc(x) as a reference to test against. In the process of testing I have noticed that the Intel 2024.1 ICX library is both fast and accurate, but there are some oddities with the Microsoft MSVC 17.1 implementation and inexplicable (to me) slowness in the dummy loop test when it is compiled with gcc 13.1. I suspect my lack of familiarity with gcc and/or systematic errors arising from running it in a virtual machine may be to blame. I'd be grateful for timings on a native Linux system for comparison. Compiler options are -O3inline anything for maximum speed and FP mode precise.

This is the minimum reproducible example.

#include <stdio.h>#include <stdint.h>#include <string.h>#include <time.h>#include <math.h>//#define RANDOM  (1)// helper routinesfloat uint32_as_float(uint32_t a){   float r;   memcpy(&r, &a, sizeof r);   return r;}uint32_t float_as_uint32(float a){   uint32_t r;   memcpy(&r, &a, sizeof r);   return r;}bool my_isnan(float x){  //    used here so that -ffast-math can't optimise it away (as it does with system isnan()  //    other options can be used here  //    return __isnan((double)x);  //    return __isnanf(x);  uint32_t ix = float_as_uint32(x);  //    return !((~ix) & 0x7f800000);  // Intel's choice is slightly faster  return (ix & 0x7f800000) == 0x7f800000;}float erfcdf(float x){  return (float)erfc((double)x);  // shim for calling double precision erfc}float dummy(float x)  {  return x;  // to determine loop overheads}void timefun(const char* name, float (*test_fun)(float), bool verbose){  uint32_t argi, largi = 0;  float arg, res, sum;  time_t start, end;  printf("\nTiming %s\n", name);  argi = 0;  sum = 0.0;  start = clock();  do {      arg = uint32_as_float(argi);      res = (*test_fun)(arg);      if (!my_isnan(res)) sum += res;#ifdef RANDOM      argi = (argi * 1664525 + 1013904223); // ranqd1 #else      argi++;#endif      if (verbose && ((argi & 0xff800000) != largi))      {          end = clock();          largi = argi & 0xff800000;          printf("Exp %x : %6.3f\n", argi >> 20, (float)(end - start) / CLOCKS_PER_SEC);          start = clock();      }      if ((argi & 0x3ffffff) == 0) printf(".");  } while (argi);  end = clock();  printf("\ntime taken %6.2f  sum = %g\n", (float)(end - start) / CLOCKS_PER_SEC, sum);}int main(void){  timefun("dummy", dummy, false);  timefun("erfcf", erfcf, false);  timefun("erfcdf", erfcdf, false);  return 0;}

It should compile as is on any of gcc, Intel or MS compilers. I'd be interested in benchmarks on other compilers too.

These are the figures. Linear tests every possible bit pattern of x (including Nans) in sequence from 0 to 0xffffffff. Random uses ranqd1 to break any branch prediction and still execute every possible value of x.

Compiler	Dummy Linear	Linear erfcf	Linear erfc	Rand Dummy	Rand erfcf	Rand erfc
gcc 13.1	31	31.6	51.1	38.7	192	169
Intel 2024.1	2.0	37.6	45.0	3.9	46.9	54.3
MS 17.1	2.0	96.2	47.6	4.0	275	125

I'm unhappy with the behaviour of GCC with the dummy(x) routine which simply returns x. It is an order of magnitude slower than the others. It also seems far too slow when fed pseudo random data. I presume that my lack of familiarity with that compiler has led to it being a lot slower than it should be. The summation for dummy will overflow so I suspect that I haven't got the compiler options right to handle that situation efficiently.

The command line I'm using to build it on Linux is:

gcc -O3 -march=native -mavx2 -finline-functions -Winline erfc_njaffa6.cpp -lm

I am fairly convinced that the MS slowness is due to naive processing of denorms in polynomials which can be very slow. Break into the debugger when the dots stall and you will see denorm values for x. MS is the only one where float erfcf speed is 2x slower than erfc for doubles. GCC and Intel the float linear versions of code are both marginally faster ~15% (as you might expect). GCC struggles and is slow for the pseudo random test and I don't understand why. Any suggestions how to make it run faster?

There is also a question of accuracy. My benchmark for that shows that all of the double erfc(x) implementations give 0.5 ULP when rounded to float (not surprisingly), but that only the Intel library implementation for float erfcf(x) achieves sub ULP accuracy (0.82 ULP worst case). GCC comes second with 2.8 ULP and MS trails in third with 5.7 ULP.

So that the MS float erfcf implementation is not only twice as slow as their double implementation but also 10x less accurate as well. Even with the shims to convert to and from double MS erfc(x) is 2x faster than erfcf(x). Most odd...

MSVC FP library float erfcf(x) unexpectedly slow - 2x slower than double erfc(x). Why?

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: Ziba Zako ft Rich Bizzy & General Kanene – Chikwati (Prod by: Bicko...

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...