Structure of SSE vectorization calls for summing vector of floats

This question was brought up by the recent question Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?. Not having delved into SSE intrinsic, the question arose, "How best to structure the loads, adds and stores for summing larger arrays" as there were several naive approaches that seemed to be available.

Given the same array of floats from the linked question above, but a longer array where the sum of loads and adds would have to continually update the solution array to obtain the final sums.

First Thought On Extending

My first thought on extending to get a final sum for a larger array was simply creating a temporary array of 4 floats to hold the sum of each 128 bit operation and then add that to the current solution array keeping a running total within a loop from the caller, e.g.

#include <stdio.h>#include <xmmintrin.h>void compute (const float *a, const float *b, float *c){    __m128 va = _mm_loadu_ps(a);    __m128 vb = _mm_loadu_ps(b);    __m128 vc = _mm_add_ps(va, vb);    _mm_storeu_ps(c, vc);}int main (void) {  float a[] = { 1.1, 2.2, 3.3, 4.4, 1.1, 2.2, 3.3, 4.4 },        b[] = { 1.1, 2.2, 3.3, 4.4, 1.1, 2.2, 3.3, 4.4 },        c[4] = { 0 };  for (int i = 0; i < 8; i += 4) {    float tmp[4] = { 0 };    compute (a + i, b + i, tmp);    compute (c, tmp, c);  }  for (int i = 0; i < 4; i++) {    printf ("c[%d]: %5.2f\n", i, c[i]);  }}

My concern was this added the overhead of 4 calls to the compute() function, so a second naive approach extending the compute() function to handle the update of c the solution array within the compute function itself, which led to my:

Second Thought on Extending

#include <stdio.h>#include <xmmintrin.h>void compute (const float *a, const float *b, float *c){    __m128 va = _mm_loadu_ps(a);    __m128 vb = _mm_loadu_ps(b);    __m128 vt = _mm_add_ps(va, vb);    __m128 vc = _mm_loadu_ps(c);    __m128 vr = _mm_add_ps(vt, vc);    _mm_storeu_ps(c, vr);}int main (void) {  float a[] = { 1.1, 2.2, 3.3, 4.4, 1.1, 2.2, 3.3, 4.4 },        b[] = { 1.1, 2.2, 3.3, 4.4, 1.1, 2.2, 3.3, 4.4 },        c[4] = { 0 };  for (int i = 0; i < 8; i += 4) {    compute (a + i, b + i, c);  }  for (int i = 0; i < 4; i++) {    printf ("c[%d]: %5.2f\n", i, c[i]);  }}

Which raised the question of "Do the changes even matter?"

Surprisingly (or maybe not for those more experienced with vectorization), but the assembly produced by both was virtually identical.

Assembly for the 1st Attempt

        .file   "sse-xmm-test-2.c"        .intel_syntax noprefix        .text        .p2align 4        .globl  compute        .type   compute, @functioncompute:.LFB529:        .cfi_startproc        vmovups xmm0, XMMWORD PTR [rsi]        vaddps  xmm0, xmm0, XMMWORD PTR [rdi]        vmovups XMMWORD PTR [rdx], xmm0        ret        .cfi_endproc.LFE529:        .size   compute, .-compute        .section        .rodata.str1.1,"aMS",@progbits,1.LC1:        .string "c[%d]: %5.2f\n"        .section        .text.startup,"ax",@progbits        .p2align 4        .globl  main        .type   main, @functionmain:.LFB530:        .cfi_startproc        push    rbp        .cfi_def_cfa_offset 16        .cfi_offset 6, -16        push    rbx        .cfi_def_cfa_offset 24        .cfi_offset 3, -24        xor     ebx, ebx        sub     rsp, 24        .cfi_def_cfa_offset 48        vmovaps xmm0, XMMWORD PTR .LC0[rip]        mov     rbp, rsp        vmovaps XMMWORD PTR [rsp], xmm0.L4:        mov     esi, ebx        vxorpd  xmm1, xmm1, xmm1        mov     edi, OFFSET FLAT:.LC1        vcvtss2sd       xmm0, xmm1, DWORD PTR [rbp+0+rbx*4]        mov     eax, 1        add     rbx, 1        call    printf        cmp     rbx, 4        jne     .L4        add     rsp, 24        .cfi_def_cfa_offset 24        xor     eax, eax        pop     rbx        .cfi_def_cfa_offset 16        pop     rbp        .cfi_def_cfa_offset 8        ret        .cfi_endproc.LFE530:        .size   main, .-main        .section        .rodata.cst16,"aM",@progbits,16        .align 16.LC0:        .long   1082969293        .long   1091357901        .long   1095971635        .long   1099746509        .ident  "GCC: (SUSE Linux) 14.2.0"        .section        .note.GNU-stack,"",@progbits

And the virtually identical assembly for the second approach that adds an additional vaddps for the running sum in the compute() function, but otherwise the assembly looks the same:

        .file   "sse-xmm-test-3.c"        .intel_syntax noprefix        .text        .p2align 4        .globl  compute        .type   compute, @functioncompute:.LFB529:        .cfi_startproc        vmovups xmm0, XMMWORD PTR [rdx]        vaddps  xmm0, xmm0, XMMWORD PTR [rsi]        vaddps  xmm0, xmm0, XMMWORD PTR [rdi]        vmovups XMMWORD PTR [rdx], xmm0        ret        .cfi_endproc.LFE529:        .size   compute, .-compute        .section        .rodata.str1.1,"aMS",@progbits,1.LC1:        .string "c[%d]: %5.2f\n"        .section        .text.startup,"ax",@progbits        .p2align 4        .globl  main        .type   main, @functionmain:.LFB530:        .cfi_startproc        push    rbp        .cfi_def_cfa_offset 16        .cfi_offset 6, -16        push    rbx        .cfi_def_cfa_offset 24        .cfi_offset 3, -24        xor     ebx, ebx        sub     rsp, 24        .cfi_def_cfa_offset 48        vmovaps xmm0, XMMWORD PTR .LC0[rip]        mov     rbp, rsp        vmovaps XMMWORD PTR [rsp], xmm0.L4:        mov     esi, ebx        vxorpd  xmm1, xmm1, xmm1        mov     edi, OFFSET FLAT:.LC1        vcvtss2sd       xmm0, xmm1, DWORD PTR [rbp+0+rbx*4]        mov     eax, 1        add     rbx, 1        call    printf        cmp     rbx, 4        jne     .L4        add     rsp, 24        .cfi_def_cfa_offset 24        xor     eax, eax        pop     rbx        .cfi_def_cfa_offset 16        pop     rbp        .cfi_def_cfa_offset 8        ret        .cfi_endproc.LFE530:        .size   main, .-main        .section        .rodata.cst16,"aM",@progbits,16        .align 16.LC0:        .long   1082969293        .long   1091357901        .long   1095971635        .long   1099746509        .ident  "GCC: (SUSE Linux) 14.2.0"        .section        .note.GNU-stack,"",@progbits

So other than the second approach adding a vaddps using rsi in the compute() function, the rest seems to be optimized identically.

What I hoped to find was one approach showing the compiler was better able to optimize one or the other, but it appears it is a basic wash.

Beyond comparing the assembly produced, is the any general principle for handling summing of SSE vectors that would prefer one of the approaches over the other?

Structure of SSE vectorization calls for summing vector of floats

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List