Why does loop alignment on 32 byte make code faster?

Look at this code:

one.cpp:

bool test(int a, int b, int c, int d);int main() {        volatile int va = 1;        volatile int vb = 2;        volatile int vc = 3;        volatile int vd = 4;        int a = va;        int b = vb;        int c = vc;        int d = vd;        int s = 0;        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");        __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");        for (int i=0; i<2000000000; i++) {                s += test(a, b, c, d);        }        return s;}

two.cpp:

bool test(int a, int b, int c, int d) {        // return a == d || b == d || c == d;        return false;}

There are 16 nops in one.cpp. You can comment/decomment them to change alignment of the loop's entry point between 16 and 32. I've compiled them with g++ one.cpp two.cpp -O3 -mtune=native.

Here are my questions:

the 32-aligned version is faster than the 16-aligned version. On Sandy Bridge, the difference is 20%; on Haswell, 8%. Why is the difference?
with the 32-aligned version, the code runs the same speed on Sandy Bridge, doesn't matter which return statement is in two.cpp. I thought the return false version should be faster at least a little bit. But no, exactly the same speed!
If I remove volatiles from one.cpp, code becomes slower (Haswell: before: ~2.17 sec, after: ~2.38 sec). Why is that? But this only happens, when the loop aligned to 32.

The fact that 32-aligned version is faster, is strange to me, because Intel® 64 and IA-32 ArchitecturesOptimization Reference Manual says (page 3-9):

Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16- byte aligned.

Another little question: is there any tricks to make only this loop 32-aligned (so rest of the code could keep using 16-byte alignment)?

Note: I've tried compilers gcc 6, gcc 7 and clang 3.9, same results.

Here's the code with volatile (the code is the same for 16/32 aligned, just the address differ):

0000000000000560 <main>: 560:   41 57                   push   r15 562:   41 56                   push   r14 564:   41 55                   push   r13 566:   41 54                   push   r12 568:   55                      push   rbp 569:   31 ed                   xor    ebp,ebp 56b:   53                      push   rbx 56c:   bb 00 94 35 77          mov    ebx,0x77359400 571:   48 83 ec 18             sub    rsp,0x18 575:   c7 04 24 01 00 00 00    mov    DWORD PTR [rsp],0x1 57c:   c7 44 24 04 02 00 00    mov    DWORD PTR [rsp+0x4],0x2 583:   00  584:   c7 44 24 08 03 00 00    mov    DWORD PTR [rsp+0x8],0x3 58b:   00  58c:   c7 44 24 0c 04 00 00    mov    DWORD PTR [rsp+0xc],0x4 593:   00  594:   44 8b 3c 24             mov    r15d,DWORD PTR [rsp] 598:   44 8b 74 24 04          mov    r14d,DWORD PTR [rsp+0x4] 59d:   44 8b 6c 24 08          mov    r13d,DWORD PTR [rsp+0x8] 5a2:   44 8b 64 24 0c          mov    r12d,DWORD PTR [rsp+0xc] 5a7:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0] 5ac:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0] 5b3:   00 00 00  5b6:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0] 5bd:   00 00 00  5c0:   44 89 e1                mov    ecx,r12d 5c3:   44 89 ea                mov    edx,r13d 5c6:   44 89 f6                mov    esi,r14d 5c9:   44 89 ff                mov    edi,r15d 5cc:   e8 4f 01 00 00          call   720 <test(int, int, int, int)> 5d1:   0f b6 c0                movzx  eax,al 5d4:   01 c5                   add    ebp,eax 5d6:   83 eb 01                sub    ebx,0x1 5d9:   75 e5                   jne    5c0 <main+0x60> 5db:   48 83 c4 18             add    rsp,0x18 5df:   89 e8                   mov    eax,ebp 5e1:   5b                      pop    rbx 5e2:   5d                      pop    rbp 5e3:   41 5c                   pop    r12 5e5:   41 5d                   pop    r13 5e7:   41 5e                   pop    r14 5e9:   41 5f                   pop    r15 5eb:   c3                      ret     5ec:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

Without volatile:

0000000000000560 <main>: 560:   55                      push   rbp 561:   31 ed                   xor    ebp,ebp 563:   53                      push   rbx 564:   bb 00 94 35 77          mov    ebx,0x77359400 569:   48 83 ec 08             sub    rsp,0x8 56d:   66 0f 1f 84 00 00 00    nop    WORD PTR [rax+rax*1+0x0] 574:   00 00  576:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0] 57d:   00 00 00  580:   b9 04 00 00 00          mov    ecx,0x4 585:   ba 03 00 00 00          mov    edx,0x3 58a:   be 02 00 00 00          mov    esi,0x2 58f:   bf 01 00 00 00          mov    edi,0x1 594:   e8 47 01 00 00          call   6e0 <test(int, int, int, int)> 599:   0f b6 c0                movzx  eax,al 59c:   01 c5                   add    ebp,eax 59e:   83 eb 01                sub    ebx,0x1 5a1:   75 dd                   jne    580 <main+0x20> 5a3:   48 83 c4 08             add    rsp,0x8 5a7:   89 e8                   mov    eax,ebp 5a9:   5b                      pop    rbx 5aa:   5d                      pop    rbp 5ab:   c3                      ret     5ac:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

Why does loop alignment on 32 byte make code faster?

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Bureau of Internal Revenue: Regional Offices (Directory)

Karimnagar District Tahsildars Phone Numbers-Mobile Numbers Telangana-State

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Windows Update / Microsoft Update の接続先 URL について

Moondru Mudichu 10-10-2016 – Polimer tv Serial

MARQUISE T DUNWOODY Arrested by Miami-Dade County Corrections on Dec 27, 2016

Raj Panchayat 3rd / Third Grade Teacher Revised Result 2012 Level 1-2...

Black Angus Grilled Artichokes

[Single] Taylor Swift – I Knew You Were Trouble (Live from the BRITs 2013)...

EV2300 driver for windows10

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

APRILIA DORSODURO......I NEED TO ADJUST KM ..........PLEASE

Vicky Kaushal, Katrina Kaif And Others At Screening Of Film Bhoot The Haunted...

Ek Hatheli Teri Ho Lyrics Translation | Ishq Ke Parindey

Mtu mwenye Div four ya 26,unaweza kusomea nini??

99 Rain Status for Whatsapp - Best Rain Dp Collection

The 10 Tennessee Cities With The Largest Black Population For 2021

Videohive Slideshow Opener 04 - Premiere Pro

PROBATE NOTICES