I'm trying to optimize a small "vector of 4 floats" wrapper class, and of course I want to make it convenient as well. For example:
typedef float v4f __attribute__ ((vector_size (16)));
struct V4 {
union {
v4f packed;
#if 1
struct { float r, g, b, a; };
#endif
#if 1
float data[4];
#endif
};
V4() = default;
V4(v4f v) : packed(v) {}
};
V4 AddV4(V4 a, V4 b) {
return a.packed + b.packed;
}
V4 MulV4(V4 a, V4 b) {
return a.packed * b.packed;
}
static_assert(sizeof(V4) == 16);
I know the union is undefined behavior in theory, but in practice it's working fine ;-)
The problem is the following: I tested this in godbolt (see https://godbolt.org/z/fXbtre), using both gcc and clang, with the command line arguments:
-O3 -fomit-frame-pointer -fno-rtti -fno-exceptions -mavx -ffast-math
If I disable both the struct and the array from the union (i.e. set both to #if 0), I get a really compact AddV4 and MulV4 functions, e.g.:
AddV4(V4, V4):
vaddps xmm0, xmm0, xmm1
ret
But if I enable ANY of those two, I get:
AddV4(V4, V4):
vmovq QWORD PTR [rsp-32], xmm1
vmovq QWORD PTR [rsp-40], xmm0
vmovaps xmm5, XMMWORD PTR [rsp-40]
vmovq QWORD PTR [rsp-24], xmm2
vmovq QWORD PTR [rsp-16], xmm3
vaddps xmm4, xmm5, XMMWORD PTR [rsp-24]
vmovaps XMMWORD PTR [rsp-40], xmm4
mov rax, QWORD PTR [rsp-32]
vmovq xmm0, QWORD PTR [rsp-40]
vmovq xmm1, rax
mov QWORD PTR [rsp-24], rax
ret
Can someone explain why? Is there a compiler flag for gcc/clang I could use to fix this? Or is it really the only option to use only the packed data structure? (in that case I need to write accessor methods x(), y(), z(), w(), and that would be quite a big change in our codebase, hence I would prefer another option first).