Look at this code:
one.cpp:
bool test(int a, int b, int c, int d);int main() { volatile int va = 1; volatile int vb = 2; volatile int vc = 3; volatile int vd = 4; int a = va; int b = vb; int c = vc; int d = vd; int s = 0; __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); for (int i=0; i<2000000000; i++) { s += test(a, b, c, d); } return s;}
two.cpp:
bool test(int a, int b, int c, int d) { // return a == d || b == d || c == d; return false;}
There are 16 nop
s in one.cpp. You can comment/decomment them to change alignment of the loop's entry point between 16 and 32. I've compiled them with g++ one.cpp two.cpp -O3 -mtune=native
.
Here are my questions:
- the 32-aligned version is faster than the 16-aligned version. On Sandy Bridge, the difference is 20%; on Haswell, 8%. Why is the difference?
- with the 32-aligned version, the code runs the same speed on Sandy Bridge, doesn't matter which return statement is in two.cpp. I thought the
return false
version should be faster at least a little bit. But no, exactly the same speed! - If I remove
volatile
s from one.cpp, code becomes slower (Haswell: before: ~2.17 sec, after: ~2.38 sec). Why is that? But this only happens, when the loop aligned to 32.
The fact that 32-aligned version is faster, is strange to me, because Intel® 64 and IA-32 ArchitecturesOptimization Reference Manual says (page 3-9):
Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16- byte aligned.
Another little question: is there any tricks to make only this loop 32-aligned (so rest of the code could keep using 16-byte alignment)?
Note: I've tried compilers gcc 6, gcc 7 and clang 3.9, same results.
Here's the code with volatile (the code is the same for 16/32 aligned, just the address differ):
0000000000000560 <main>: 560: 41 57 push r15 562: 41 56 push r14 564: 41 55 push r13 566: 41 54 push r12 568: 55 push rbp 569: 31 ed xor ebp,ebp 56b: 53 push rbx 56c: bb 00 94 35 77 mov ebx,0x77359400 571: 48 83 ec 18 sub rsp,0x18 575: c7 04 24 01 00 00 00 mov DWORD PTR [rsp],0x1 57c: c7 44 24 04 02 00 00 mov DWORD PTR [rsp+0x4],0x2 583: 00 584: c7 44 24 08 03 00 00 mov DWORD PTR [rsp+0x8],0x3 58b: 00 58c: c7 44 24 0c 04 00 00 mov DWORD PTR [rsp+0xc],0x4 593: 00 594: 44 8b 3c 24 mov r15d,DWORD PTR [rsp] 598: 44 8b 74 24 04 mov r14d,DWORD PTR [rsp+0x4] 59d: 44 8b 6c 24 08 mov r13d,DWORD PTR [rsp+0x8] 5a2: 44 8b 64 24 0c mov r12d,DWORD PTR [rsp+0xc] 5a7: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] 5ac: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 5b3: 00 00 00 5b6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 5bd: 00 00 00 5c0: 44 89 e1 mov ecx,r12d 5c3: 44 89 ea mov edx,r13d 5c6: 44 89 f6 mov esi,r14d 5c9: 44 89 ff mov edi,r15d 5cc: e8 4f 01 00 00 call 720 <test(int, int, int, int)> 5d1: 0f b6 c0 movzx eax,al 5d4: 01 c5 add ebp,eax 5d6: 83 eb 01 sub ebx,0x1 5d9: 75 e5 jne 5c0 <main+0x60> 5db: 48 83 c4 18 add rsp,0x18 5df: 89 e8 mov eax,ebp 5e1: 5b pop rbx 5e2: 5d pop rbp 5e3: 41 5c pop r12 5e5: 41 5d pop r13 5e7: 41 5e pop r14 5e9: 41 5f pop r15 5eb: c3 ret 5ec: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
Without volatile:
0000000000000560 <main>: 560: 55 push rbp 561: 31 ed xor ebp,ebp 563: 53 push rbx 564: bb 00 94 35 77 mov ebx,0x77359400 569: 48 83 ec 08 sub rsp,0x8 56d: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0] 574: 00 00 576: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 57d: 00 00 00 580: b9 04 00 00 00 mov ecx,0x4 585: ba 03 00 00 00 mov edx,0x3 58a: be 02 00 00 00 mov esi,0x2 58f: bf 01 00 00 00 mov edi,0x1 594: e8 47 01 00 00 call 6e0 <test(int, int, int, int)> 599: 0f b6 c0 movzx eax,al 59c: 01 c5 add ebp,eax 59e: 83 eb 01 sub ebx,0x1 5a1: 75 dd jne 580 <main+0x20> 5a3: 48 83 c4 08 add rsp,0x8 5a7: 89 e8 mov eax,ebp 5a9: 5b pop rbx 5aa: 5d pop rbp 5ab: c3 ret 5ac: 0f 1f 40 00 nop DWORD PTR [rax+0x0]