Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22113

GCC not using SSE intrinsics in compiled code [duplicate]

$
0
0

I'm doing some testing to see what the fastest way of computing the dot product of two vectors is for me, and if I can find a way that's faster than simply a.x * b.x + a.y * b.y + a.z * b.z. I've been looking at a lot of different posts on here, and I decided to try one of the functions from this answer.

I have the following function in my C file:

float hsum_sse1(__m128 v) {    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0    __m128 sums = _mm_add_ps(v, shuf);    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half    sums        = _mm_add_ss(sums, shuf);    return        _mm_cvtss_f32(sums);}

and I compiled it with gcc -std=c11 -march=native main.c, but when I did objdump to look at the generated assembly, I got a function that doesn't use the intrinsics that I specified:

00000000004005bd <hsum_sse1>:  4005bd:   55                      push   %rbp  4005be:   48 89 e5                mov    %rsp,%rbp  4005c1:   48 83 ec 3c             sub    $0x3c,%rsp  4005c5:   c5 f8 29 85 50 ff ff    vmovaps %xmm0,-0xb0(%rbp)  4005cc:   ff   4005cd:   c5 f8 28 85 50 ff ff    vmovaps -0xb0(%rbp),%xmm0  4005d4:   ff   4005d5:   c5 f8 29 45 d0          vmovaps %xmm0,-0x30(%rbp)  4005da:   c5 fa 16 45 d0          vmovshdup -0x30(%rbp),%xmm0  4005df:   c5 f8 29 45 f0          vmovaps %xmm0,-0x10(%rbp)  4005e4:   c5 f8 28 85 50 ff ff    vmovaps -0xb0(%rbp),%xmm0  4005eb:   ff   4005ec:   c5 f8 29 45 c0          vmovaps %xmm0,-0x40(%rbp)  4005f1:   c5 f8 28 45 f0          vmovaps -0x10(%rbp),%xmm0  4005f6:   c5 f8 29 45 b0          vmovaps %xmm0,-0x50(%rbp)  4005fb:   c5 f8 28 45 b0          vmovaps -0x50(%rbp),%xmm0  400600:   c5 f8 28 4d c0          vmovaps -0x40(%rbp),%xmm1  400605:   c5 f0 58 c0             vaddps %xmm0,%xmm1,%xmm0  400609:   c5 f8 29 45 e0          vmovaps %xmm0,-0x20(%rbp)  40060e:   c5 f8 28 45 f0          vmovaps -0x10(%rbp),%xmm0  400613:   c5 f8 29 45 a0          vmovaps %xmm0,-0x60(%rbp)  400618:   c5 f8 28 45 e0          vmovaps -0x20(%rbp),%xmm0  40061d:   c5 f8 29 45 90          vmovaps %xmm0,-0x70(%rbp)  400622:   c5 f8 28 45 90          vmovaps -0x70(%rbp),%xmm0  400627:   c5 f8 28 4d a0          vmovaps -0x60(%rbp),%xmm1  40062c:   c5 f0 12 c0             vmovhlps %xmm0,%xmm1,%xmm0  400630:   c5 f8 29 45 f0          vmovaps %xmm0,-0x10(%rbp)  400635:   c5 f8 28 45 e0          vmovaps -0x20(%rbp),%xmm0  40063a:   c5 f8 29 45 80          vmovaps %xmm0,-0x80(%rbp)  40063f:   c5 f8 28 45 f0          vmovaps -0x10(%rbp),%xmm0  400644:   c5 f8 29 85 70 ff ff    vmovaps %xmm0,-0x90(%rbp)  40064b:   ff   40064c:   c5 f8 28 45 80          vmovaps -0x80(%rbp),%xmm0  400651:   c5 fa 58 85 70 ff ff    vaddss -0x90(%rbp),%xmm0,%xmm0  400658:   ff   400659:   c5 f8 29 45 e0          vmovaps %xmm0,-0x20(%rbp)  40065e:   c5 f8 28 45 e0          vmovaps -0x20(%rbp),%xmm0  400663:   c5 f8 29 85 60 ff ff    vmovaps %xmm0,-0xa0(%rbp)  40066a:   ff   40066b:   c5 f8 28 85 60 ff ff    vmovaps -0xa0(%rbp),%xmm0  400672:   ff   400673:   c5 f8 28 c0             vmovaps %xmm0,%xmm0  400677:   c5 fa 11 85 4c ff ff    vmovss %xmm0,-0xb4(%rbp)  40067e:   ff   40067f:   8b 85 4c ff ff ff       mov    -0xb4(%rbp),%eax  400685:   89 85 4c ff ff ff       mov    %eax,-0xb4(%rbp)  40068b:   c5 fa 10 85 4c ff ff    vmovss -0xb4(%rbp),%xmm0  400692:   ff   400693:   c9                      leaveq   400694:   c3                      retq   

I don't know if it makes a difference, but I'm compiling this code on a CentOS VM running on Windows. Just to make sure, I downloaded Coreinfo from here and got the following output

FPU             *       Implements i387 floating point instructionsMMX             *       Supports MMX instruction setMMXEXT          -       Implements AMD MMX extensions3DNOW           -       Supports 3DNow! instructions3DNOWEXT        -       Supports 3DNow! extension instructionsSSE             *       Supports Streaming SIMD ExtensionsSSE2            *       Supports Streaming SIMD Extensions 2SSE3            *       Supports Streaming SIMD Extensions 3SSSE3           *       Supports Supplemental SIMD Extensions 3SSE4a           -       Supports Streaming SIMDR Extensions 4aSSE4.1          *       Supports Streaming SIMD Extensions 4.1SSE4.2          *       Supports Streaming SIMD Extensions 4.2

so it seems like my CPU should be able to use the SSE instructions that I wrote in the C file. I also checked my GCC version (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)), which seems compatible too. How can I get a more efficient compiled function?


Viewing all articles
Browse latest Browse all 22113

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>