This is an example code from Computer Systems: A Programmer's Perspective.
short foo(short a, short b) { short result; result = b; while(b > 0) { result *= a; b -= a; } return result;}
Compiling this with gcc14.2
using -O1
and -O3
gives the following assembly (godbolt)
O1
foo: movl %esi, %eax movl %esi, %ecx testw %si, %si jle .L1.L3: imull %edi, %eax subl %edi, %ecx testw %cx, %cx jg .L3.L1: ret
O3
foo: movl %esi, %eax movl %esi, %edx testw %si, %si jle .L1.L3: subl %edi, %edx imull %edi, %eax testw %dx, %dx jg .L3.L1: ret
With O3
, sub
comes before imul
- why is this reordering helpful?
- I am assuming this has to do with instruction-level parallelism, but cannot get the exact reason. Is it that performing subtraction first will allow parts of multiplication to also get done with some idle ALUs?
- With respect to pipelining, is there any difference due to this reordering?