When ARM gcc 9.2.1 is given command line options -O3 -xc++ -mcpu=cortex-m0
[compile as C++] and the following code:
unsigned short adjust(unsigned short *p){ unsigned short temp = *p; temp -= temp>>15; return temp;}
it produces the reasonable machine code:
ldrh r0, [r0] lsrs r3, r0, #15 subs r0, r0, r3 uxth r0, r0 bx lr
which is equivalent to:
unsigned short adjust(unsigned short *p){ unsigned r0,r3; r0 = *p; r3 = temp >> 15; r0 -= r3; r0 &= 0xFFFFu; // Returning an unsigned short requires... return r0; // computing a 32-bit unsigned value 0-65535.}
Very reasonable. The last "uxtw" could actually be omitted in this particular case, but it's better for a compiler that can't prove the safety of such optimizations to err on the side of caution than risk returning a value outside the range 0-65535, which could totally sink downstream code.
When using -O3 -xc -mcpu=cortex-m0
[identical options, except compiling as C rather than C++], however, the code changes:
ldrh r3, [r0] movs r2, #0 ldrsh r0, [r0, r2] asrs r0, r0, #15 adds r0, r0, r3 uxth r0, r0 bx lrunsigned short adjust(unsigned short *p){ unsigned r0,r2,r3; r3 = *p; r2 = 0; r0 = ((unsigned short*)p)[r2]; r0 = ((int)r0) >> 15; // Effectively computes -((*p)>>15) with redundant load r0 += r3 r0 &= 0xFFFFu; // Returning an unsigned short requires... return temp; // computing a 32-bit unsigned value 0-65535.}
I know that the defined corner cases for left-shift are different in C and C++, but I thought right shifts were the same. Is there something different about the way right-shifts work in C and C++ that would cause the compiler to use different code to process them? Versions prior to 9.2.1 generate slightly less bad code in C mode:
ldrh r3, [r0] sxth r0, r3 asrs r0, r0, #15 adds r0, r0, r3 uxth r0, r0 bx lr
equivalent to:
unsigned short adjust(unsigned short *p){ unsigned r0,r3; r3 = *p; r0 = (short)r3; r0 = ((int)r0) >> 15; // Effectively computes -(temp>>15) r0 += r3 r0 &= 0xFFFFu; // Returning an unsigned short requires... return temp; // computing a 32-bit unsigned value 0-65535.}
Not as bad as the 9.2.1 version, but still an instruction longer than a straightforward translation of the code would have been. When using 9.2.1, declaring the argument as unsigned short volatile *p
would eliminate the redundant load of p
, but I'm curious why gcc 9.2.1 would need a volatile
qualifier to help it avoid the redundant load, or why such a bizarre "optimization" only happens in C mode and not C++ mode. I'm also somewhat curious why gcc would even consider adding ((short)temp) >> 15
instead of subtracting temp >> 15
. Is there some stage in the optimization where that would seem to make sense?