My goal is to return a 4x4 floating point matrix as a return value of a function without using memory. As pointed out by the Wiki article of the "x86 calling conventions"https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI it is possible to return up to two floating point values from a function using XMM0 and XMM1.
I tried this:
struct Mat4 // just a simple struct for testing
{
__m256 m0, m1;
};
Mat4 Foo(__m256 m0, __m256 m1, __m256 m2, __m256 m3)
{
return {m1, m2};
}
But gcc gives me this as the result:
mov %rdi,%rax
vmovaps %ymm1,(%rdi)
vmovaps %ymm2,0x20(%rdi)
retq
I was expecting something like this:
vmovaps %ymm1, %ymm0
vmovaps %ymm2, %ymm1
retq
Is there any way to force gcc to return the whole struct Mat4
in just YMM0 and YMM1?