Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22006

How to force the compiler to pass a "vector of 4" wrapper class as single XMM register?

$
0
0

I'm trying to optimize a small "vector of 4 floats" wrapper class, and of course I want to make it convenient as well. For example:

typedef float v4f __attribute__ ((vector_size (16)));

struct V4 {

    union {
        v4f packed;
#if 1
        struct { float r, g, b, a; };
#endif
#if 1
        float data[4];
#endif
    };

    V4() = default;
    V4(v4f v) : packed(v) {}
};

V4 AddV4(V4 a, V4 b) { 
    return a.packed + b.packed; 
}
V4 MulV4(V4 a, V4 b) { 
    return a.packed * b.packed; 
}

static_assert(sizeof(V4) == 16);

I know the union is undefined behavior in theory, but in practice it's working fine ;-)

The problem is the following: I tested this in godbolt (see https://godbolt.org/z/fXbtre), using both gcc and clang, with the command line arguments:

-O3  -fomit-frame-pointer -fno-rtti -fno-exceptions -mavx -ffast-math 

If I disable both the struct and the array from the union (i.e. set both to #if 0), I get a really compact AddV4 and MulV4 functions, e.g.:

AddV4(V4, V4):
        vaddps  xmm0, xmm0, xmm1
        ret

But if I enable ANY of those two, I get:

AddV4(V4, V4):
        vmovq   QWORD PTR [rsp-32], xmm1
        vmovq   QWORD PTR [rsp-40], xmm0
        vmovaps xmm5, XMMWORD PTR [rsp-40]
        vmovq   QWORD PTR [rsp-24], xmm2
        vmovq   QWORD PTR [rsp-16], xmm3
        vaddps  xmm4, xmm5, XMMWORD PTR [rsp-24]
        vmovaps XMMWORD PTR [rsp-40], xmm4
        mov     rax, QWORD PTR [rsp-32]
        vmovq   xmm0, QWORD PTR [rsp-40]
        vmovq   xmm1, rax
        mov     QWORD PTR [rsp-24], rax
        ret

Can someone explain why? Is there a compiler flag for gcc/clang I could use to fix this? Or is it really the only option to use only the packed data structure? (in that case I need to write accessor methods x(), y(), z(), w(), and that would be quite a big change in our codebase, hence I would prefer another option first).


Viewing all articles
Browse latest Browse all 22006

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>