I am compiling the code below, with optimization, and it still looks like there would be a more efficient way of performing the two sums using SIMD capability of the underlying hardware. What would be the right mix of flags for GCC to generate assembly that loads operands in couples and executes the two additions in a single instruction?
#include <iostream>
struct foo {
float val[2];
foo(float a, float b)
{
val[0] = a;
val[1] = b;
}
foo& operator+=(
const foo &rhs)
{
val[0] += rhs.val[0];
val[1] += rhs.val[1];
return *this;
}
};
int main(void)
{
volatile float values[] = { 2.0, 3.0, 4.0, 7.0 };
foo first(values[0], values[1]);
foo second(values[2], values[3]);
second += first;
std::cout << "("<< second.val[0] << ","<< second.val[1] << ")"<< std::endl;
return 1;
}
The assembly code that is generated looks like this (for the operator+() alone), where seems pretty apparent that all operands are treated individually.
400712: c5 fa 10 4c 24 14 vmovss 0x14(%rsp),%xmm1
400718: c5 fa 10 5c 24 10 vmovss 0x10(%rsp),%xmm3
foo second(values[2], values[3]);
40071e: c5 fa 10 44 24 1c vmovss 0x1c(%rsp),%xmm0
400724: c5 fa 10 54 24 18 vmovss 0x18(%rsp),%xmm2
val[1] += rhs.val[1];
40072a: c5 f2 58 e8 vaddss %xmm0,%xmm1,%xmm5
val[0] += rhs.val[0];
40072e: c5 e2 58 e2 vaddss %xmm2,%xmm3,%xmm4
val[1] += rhs.val[1];
400732: c5 fa 11 6c 24 0c vmovss %xmm5,0xc(%rsp)
val[0] += rhs.val[0];
400738: c5 fa 11 64 24 08 vmovss %xmm4,0x8(%rsp)
I compile with this command (but removing -mavx2 does not change the result much):
g++ -O3 -mavx2 -g -std=c++11 main.cpp -o run
In case it matters, this is GCC 6.3 (and do not really have the freedom to upgrade).