Consider following float loop, compiled using -O3 -mavx2 -mfma
for (auto i = 0; i < a.size(); ++i) { a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;}
Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the difference between vblendps/vandps for the best performance possible.
.LBB0_7: vcmpltps ymm2, ymm1, ymm0 vmulps ymm0, ymm0, ymm1 vandps ymm0, ymm2, ymm0
GCC, however, is much worse. For some reason it doesn't get better than SSE 128-bit vectors (-mprefer-vector-width=256 won't change anything).
.L6: vcomiss xmm0, xmm1 vmulss xmm0, xmm0, xmm1 vmovss DWORD PTR [rcx+rax*4], xmm0
If replace it with plain array (as in guideline), gcc does vectorize it to AVX ymm.
int a[256], b[256], c[256];auto foo (int *a, int *b, int *c) { int i; for (i=0; i<256; i++){ a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0; }}
However I didn't find how to do it with variable-length std::vector. What sort of hint does gcc need to vectorize std::vector to AVX?