I recently made some vector-code and an appropriate godbolt example.
typedef float v8f __attribute__((vector_size(32)));typedef unsigned v8u __attribute__((vector_size(32)));v8f f(register v8f x){ return __builtin_shuffle(x, (v8f){0}, (v8u){1, 2, 3, 4, 5, 6, 7, 8});}
f: vmovaps ymm1, ymm0 vxorps xmm0, xmm0, xmm0 vperm2f128 ymm0, ymm1, ymm0, 33 vpalignr ymm0, ymm0, ymm1, 4 ret
I wanted to see how different optimization (-O0/O1/O2/O3
) settings affected the code, and all but -O0
gave identical code. -O0
gave the predictable frame-pointer garbage, and also copies the argument x
to a stack local variable for no good reason. To fix this, I added the register
storage class specifier:
typedef float v8f __attribute__((vector_size(32)));typedef unsigned v8u __attribute__((vector_size(32)));v8f f(register v8f x){ return __builtin_shuffle(x, (v8f){0}, (v8u){1, 2, 3, 4, 5, 6, 7, 8});}
For -O1/O2/O3
, the generated code is identical, but at -O0
:
f: vxorps xmm1, xmm1, xmm1 vperm2f128 ymm1, ymm0, ymm1, 33 vpalignr ymm0, ymm1, ymm0, 4 ret
gcc
figured out how to avoid a redundant register-copy. While such a copy might be move-eliminated, this still increases code size for no benefit (-Os
is bigger than -O0
?).
How/why does gcc
generate better code for this at -O0
than -O3
?