Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22006

gcc optimization better at -O0 than -O3

$
0
0

I recently made some vector-code and an appropriate godbolt example.

typedef float v8f __attribute__((vector_size(32)));typedef unsigned v8u __attribute__((vector_size(32)));v8f f(register v8f x){  return __builtin_shuffle(x, (v8f){0}, (v8u){1, 2, 3, 4, 5, 6, 7, 8});}

f:        vmovaps ymm1, ymm0        vxorps  xmm0, xmm0, xmm0        vperm2f128      ymm0, ymm1, ymm0, 33        vpalignr        ymm0, ymm0, ymm1, 4        ret

I wanted to see how different optimization (-O0/O1/O2/O3) settings affected the code, and all but -O0 gave identical code. -O0 gave the predictable frame-pointer garbage, and also copies the argument x to a stack local variable for no good reason. To fix this, I added the register storage class specifier:

typedef float v8f __attribute__((vector_size(32)));typedef unsigned v8u __attribute__((vector_size(32)));v8f f(register v8f x){  return __builtin_shuffle(x, (v8f){0}, (v8u){1, 2, 3, 4, 5, 6, 7, 8});}

For -O1/O2/O3, the generated code is identical, but at -O0:

f:        vxorps  xmm1, xmm1, xmm1        vperm2f128      ymm1, ymm0, ymm1, 33        vpalignr        ymm0, ymm1, ymm0, 4        ret

gcc figured out how to avoid a redundant register-copy. While such a copy might be move-eliminated, this still increases code size for no benefit (-Os is bigger than -O0?).

How/why does gcc generate better code for this at -O0 than -O3?


Viewing all articles
Browse latest Browse all 22006

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>