Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22202

Why does loop alignment on 32 byte make code faster?

$
0
0

Look at this code:

one.cpp:

bool test(int a, int b, int c, int d);int main() {        volatile int va = 1;        volatile int vb = 2;        volatile int vc = 3;        volatile int vd = 4;        int a = va;        int b = vb;        int c = vc;        int d = vd;        int s = 0;        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");        for (int i=0; i<2000000000; i++) {                s += test(a, b, c, d);        }        return s;}

two.cpp:

bool test(int a, int b, int c, int d) {        // return a == d || b == d || c == d;        return false;}

There are 16 nops in one.cpp. You can comment/decomment them to change alignment of the loop's entry point between 16 and 32. I've compiled them with g++ one.cpp two.cpp -O3 -mtune=native.

Here are my questions:

  1. the 32-aligned version is faster than the 16-aligned version. On Sandy Bridge, the difference is 20%; on Haswell, 8%. Why is the difference?
  2. with the 32-aligned version, the code runs the same speed on Sandy Bridge, doesn't matter which return statement is in two.cpp. I thought the return false version should be faster at least a little bit. But no, exactly the same speed!
  3. If I remove volatiles from one.cpp, code becomes slower (Haswell: before: ~2.17 sec, after: ~2.38 sec). Why is that? But this only happens, when the loop aligned to 32.

The fact that 32-aligned version is faster, is strange to me, because Intel® 64 and IA-32 ArchitecturesOptimization Reference Manual says (page 3-9):

Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16- byte aligned.

Another little question: is there any tricks to make only this loop 32-aligned (so rest of the code could keep using 16-byte alignment)?

Note: I've tried compilers gcc 6, gcc 7 and clang 3.9, same results.


Here's the code with volatile (the code is the same for 16/32 aligned, just the address differ):

0000000000000560 <main>: 560:   41 57                   push   r15 562:   41 56                   push   r14 564:   41 55                   push   r13 566:   41 54                   push   r12 568:   55                      push   rbp 569:   31 ed                   xor    ebp,ebp 56b:   53                      push   rbx 56c:   bb 00 94 35 77          mov    ebx,0x77359400 571:   48 83 ec 18             sub    rsp,0x18 575:   c7 04 24 01 00 00 00    mov    DWORD PTR [rsp],0x1 57c:   c7 44 24 04 02 00 00    mov    DWORD PTR [rsp+0x4],0x2 583:   00  584:   c7 44 24 08 03 00 00    mov    DWORD PTR [rsp+0x8],0x3 58b:   00  58c:   c7 44 24 0c 04 00 00    mov    DWORD PTR [rsp+0xc],0x4 593:   00  594:   44 8b 3c 24             mov    r15d,DWORD PTR [rsp] 598:   44 8b 74 24 04          mov    r14d,DWORD PTR [rsp+0x4] 59d:   44 8b 6c 24 08          mov    r13d,DWORD PTR [rsp+0x8] 5a2:   44 8b 64 24 0c          mov    r12d,DWORD PTR [rsp+0xc] 5a7:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0] 5ac:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0] 5b3:   00 00 00  5b6:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0] 5bd:   00 00 00  5c0:   44 89 e1                mov    ecx,r12d 5c3:   44 89 ea                mov    edx,r13d 5c6:   44 89 f6                mov    esi,r14d 5c9:   44 89 ff                mov    edi,r15d 5cc:   e8 4f 01 00 00          call   720 <test(int, int, int, int)> 5d1:   0f b6 c0                movzx  eax,al 5d4:   01 c5                   add    ebp,eax 5d6:   83 eb 01                sub    ebx,0x1 5d9:   75 e5                   jne    5c0 <main+0x60> 5db:   48 83 c4 18             add    rsp,0x18 5df:   89 e8                   mov    eax,ebp 5e1:   5b                      pop    rbx 5e2:   5d                      pop    rbp 5e3:   41 5c                   pop    r12 5e5:   41 5d                   pop    r13 5e7:   41 5e                   pop    r14 5e9:   41 5f                   pop    r15 5eb:   c3                      ret     5ec:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

Without volatile:

0000000000000560 <main>: 560:   55                      push   rbp 561:   31 ed                   xor    ebp,ebp 563:   53                      push   rbx 564:   bb 00 94 35 77          mov    ebx,0x77359400 569:   48 83 ec 08             sub    rsp,0x8 56d:   66 0f 1f 84 00 00 00    nop    WORD PTR [rax+rax*1+0x0] 574:   00 00  576:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0] 57d:   00 00 00  580:   b9 04 00 00 00          mov    ecx,0x4 585:   ba 03 00 00 00          mov    edx,0x3 58a:   be 02 00 00 00          mov    esi,0x2 58f:   bf 01 00 00 00          mov    edi,0x1 594:   e8 47 01 00 00          call   6e0 <test(int, int, int, int)> 599:   0f b6 c0                movzx  eax,al 59c:   01 c5                   add    ebp,eax 59e:   83 eb 01                sub    ebx,0x1 5a1:   75 dd                   jne    580 <main+0x20> 5a3:   48 83 c4 08             add    rsp,0x8 5a7:   89 e8                   mov    eax,ebp 5a9:   5b                      pop    rbx 5aa:   5d                      pop    rbp 5ab:   c3                      ret     5ac:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

Viewing all articles
Browse latest Browse all 22202

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>