Consider this function:
unsigned long f(unsigned long x) { return x / 7;}
With -O3
, Clang turns the division into a multiplication, as expected:
f: # @f movabs rcx, 2635249153387078803 mov rax, rdi mul rcx sub rdi, rdx shr rdi lea rax, [rdi + rdx] shr rax, 2 ret
GCC does basically the same thing, except for using rdx
where Clang uses rcx
. But they both appear to be doing an extra move. Why not this instead?
f: movabs rax, 2635249153387078803 mul rdi sub rdi, rdx shr rdi lea rax, [rdi + rdx] shr rax, 2 ret
In particular, they both put the numerator in rax
, but by putting the magic number there instead, you avoid having to move the numerator at all. If this is actually better, I'm surprised that neither GCC nor Clang do it this way, since it feels so obvious. Is there some microarchitectural reason that their way is actually faster than my way?