Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22261

Why gcc is so much worse at std::vector vectorization of a conditional multiply than clang?

$
0
0

Consider following float loop, compiled using -O3 -mavx2 -mfma

for (auto i = 0; i < a.size(); ++i) {    a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;}

Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the difference between vblendps/vandps for the best performance possible.

.LBB0_7:        vcmpltps        ymm2, ymm1, ymm0        vmulps  ymm0, ymm0, ymm1        vandps  ymm0, ymm2, ymm0

GCC, however, is much worse. For some reason it doesn't get better than SSE 128-bit vectors (-mprefer-vector-width=256 won't change anything).

.L6:        vcomiss xmm0, xmm1        vmulss  xmm0, xmm0, xmm1        vmovss  DWORD PTR [rcx+rax*4], xmm0

If replace it with plain array (as in guideline), gcc does vectorize it to AVX ymm.

int a[256], b[256], c[256];auto foo (int *a, int *b, int *c) {  int i;  for (i=0; i<256; i++){    a[i] =  (b[i] > c[i]) ? (b[i] * c[i]) : 0;  }}

However I didn't find how to do it with variable-length std::vector. What sort of hint does gcc need to vectorize std::vector to AVX?

Source on Godbolt with gcc 13.1 and clang 14.0.0


Viewing all articles
Browse latest Browse all 22261


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>