This question was brought up by the recent question Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?. Not having delved into SSE intrinsic, the question arose, "How best to structure the loads, adds and stores for summing larger arrays" as there were several naive approaches that seemed to be available.
Given the same array of floats from the linked question above, but a longer array where the sum of loads and adds would have to continually update the solution array to obtain the final sums.
First Thought On Extending
My first thought on extending to get a final sum for a larger array was simply creating a temporary array of 4 floats to hold the sum of each 128 bit operation and then add that to the current solution array keeping a running total within a loop from the caller, e.g.
#include <stdio.h>#include <xmmintrin.h>void compute (const float *a, const float *b, float *c){ __m128 va = _mm_loadu_ps(a); __m128 vb = _mm_loadu_ps(b); __m128 vc = _mm_add_ps(va, vb); _mm_storeu_ps(c, vc);}int main (void) { float a[] = { 1.1, 2.2, 3.3, 4.4, 1.1, 2.2, 3.3, 4.4 }, b[] = { 1.1, 2.2, 3.3, 4.4, 1.1, 2.2, 3.3, 4.4 }, c[4] = { 0 }; for (int i = 0; i < 8; i += 4) { float tmp[4] = { 0 }; compute (a + i, b + i, tmp); compute (c, tmp, c); } for (int i = 0; i < 4; i++) { printf ("c[%d]: %5.2f\n", i, c[i]); }}
My concern was this added the overhead of 4 calls to the compute()
function, so a second naive approach extending the compute()
function to handle the update of c
the solution array within the compute function itself, which led to my:
Second Thought on Extending
#include <stdio.h>#include <xmmintrin.h>void compute (const float *a, const float *b, float *c){ __m128 va = _mm_loadu_ps(a); __m128 vb = _mm_loadu_ps(b); __m128 vt = _mm_add_ps(va, vb); __m128 vc = _mm_loadu_ps(c); __m128 vr = _mm_add_ps(vt, vc); _mm_storeu_ps(c, vr);}int main (void) { float a[] = { 1.1, 2.2, 3.3, 4.4, 1.1, 2.2, 3.3, 4.4 }, b[] = { 1.1, 2.2, 3.3, 4.4, 1.1, 2.2, 3.3, 4.4 }, c[4] = { 0 }; for (int i = 0; i < 8; i += 4) { compute (a + i, b + i, c); } for (int i = 0; i < 4; i++) { printf ("c[%d]: %5.2f\n", i, c[i]); }}
Which raised the question of "Do the changes even matter?"
Surprisingly (or maybe not for those more experienced with vectorization), but the assembly produced by both was virtually identical.
Assembly for the 1st Attempt
.file "sse-xmm-test-2.c" .intel_syntax noprefix .text .p2align 4 .globl compute .type compute, @functioncompute:.LFB529: .cfi_startproc vmovups xmm0, XMMWORD PTR [rsi] vaddps xmm0, xmm0, XMMWORD PTR [rdi] vmovups XMMWORD PTR [rdx], xmm0 ret .cfi_endproc.LFE529: .size compute, .-compute .section .rodata.str1.1,"aMS",@progbits,1.LC1: .string "c[%d]: %5.2f\n" .section .text.startup,"ax",@progbits .p2align 4 .globl main .type main, @functionmain:.LFB530: .cfi_startproc push rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 push rbx .cfi_def_cfa_offset 24 .cfi_offset 3, -24 xor ebx, ebx sub rsp, 24 .cfi_def_cfa_offset 48 vmovaps xmm0, XMMWORD PTR .LC0[rip] mov rbp, rsp vmovaps XMMWORD PTR [rsp], xmm0.L4: mov esi, ebx vxorpd xmm1, xmm1, xmm1 mov edi, OFFSET FLAT:.LC1 vcvtss2sd xmm0, xmm1, DWORD PTR [rbp+0+rbx*4] mov eax, 1 add rbx, 1 call printf cmp rbx, 4 jne .L4 add rsp, 24 .cfi_def_cfa_offset 24 xor eax, eax pop rbx .cfi_def_cfa_offset 16 pop rbp .cfi_def_cfa_offset 8 ret .cfi_endproc.LFE530: .size main, .-main .section .rodata.cst16,"aM",@progbits,16 .align 16.LC0: .long 1082969293 .long 1091357901 .long 1095971635 .long 1099746509 .ident "GCC: (SUSE Linux) 14.2.0" .section .note.GNU-stack,"",@progbits
And the virtually identical assembly for the second approach that adds an additional vaddps
for the running sum in the compute()
function, but otherwise the assembly looks the same:
.file "sse-xmm-test-3.c" .intel_syntax noprefix .text .p2align 4 .globl compute .type compute, @functioncompute:.LFB529: .cfi_startproc vmovups xmm0, XMMWORD PTR [rdx] vaddps xmm0, xmm0, XMMWORD PTR [rsi] vaddps xmm0, xmm0, XMMWORD PTR [rdi] vmovups XMMWORD PTR [rdx], xmm0 ret .cfi_endproc.LFE529: .size compute, .-compute .section .rodata.str1.1,"aMS",@progbits,1.LC1: .string "c[%d]: %5.2f\n" .section .text.startup,"ax",@progbits .p2align 4 .globl main .type main, @functionmain:.LFB530: .cfi_startproc push rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 push rbx .cfi_def_cfa_offset 24 .cfi_offset 3, -24 xor ebx, ebx sub rsp, 24 .cfi_def_cfa_offset 48 vmovaps xmm0, XMMWORD PTR .LC0[rip] mov rbp, rsp vmovaps XMMWORD PTR [rsp], xmm0.L4: mov esi, ebx vxorpd xmm1, xmm1, xmm1 mov edi, OFFSET FLAT:.LC1 vcvtss2sd xmm0, xmm1, DWORD PTR [rbp+0+rbx*4] mov eax, 1 add rbx, 1 call printf cmp rbx, 4 jne .L4 add rsp, 24 .cfi_def_cfa_offset 24 xor eax, eax pop rbx .cfi_def_cfa_offset 16 pop rbp .cfi_def_cfa_offset 8 ret .cfi_endproc.LFE530: .size main, .-main .section .rodata.cst16,"aM",@progbits,16 .align 16.LC0: .long 1082969293 .long 1091357901 .long 1095971635 .long 1099746509 .ident "GCC: (SUSE Linux) 14.2.0" .section .note.GNU-stack,"",@progbits
So other than the second approach adding a vaddps
using rsi
in the compute()
function, the rest seems to be optimized identically.
What I hoped to find was one approach showing the compiler was better able to optimize one or the other, but it appears it is a basic wash.
Beyond comparing the assembly produced, is the any general principle for handling summing of SSE vectors that would prefer one of the approaches over the other?