GCC not using SSE intrinsics in compiled code [duplicate]

I'm doing some testing to see what the fastest way of computing the dot product of two vectors is for me, and if I can find a way that's faster than simply a.x * b.x + a.y * b.y + a.z * b.z. I've been looking at a lot of different posts on here, and I decided to try one of the functions from this answer.

I have the following function in my C file:

float hsum_sse1(__m128 v) {    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0    __m128 sums = _mm_add_ps(v, shuf);    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half    sums        = _mm_add_ss(sums, shuf);    return        _mm_cvtss_f32(sums);}

and I compiled it with gcc -std=c11 -march=native main.c, but when I did objdump to look at the generated assembly, I got a function that doesn't use the intrinsics that I specified:

00000000004005bd <hsum_sse1>:  4005bd:   55                      push   %rbp  4005be:   48 89 e5                mov    %rsp,%rbp  4005c1:   48 83 ec 3c             sub    $0x3c,%rsp  4005c5:   c5 f8 29 85 50 ff ff    vmovaps %xmm0,-0xb0(%rbp)  4005cc:   ff   4005cd:   c5 f8 28 85 50 ff ff    vmovaps -0xb0(%rbp),%xmm0  4005d4:   ff   4005d5:   c5 f8 29 45 d0          vmovaps %xmm0,-0x30(%rbp)  4005da:   c5 fa 16 45 d0          vmovshdup -0x30(%rbp),%xmm0  4005df:   c5 f8 29 45 f0          vmovaps %xmm0,-0x10(%rbp)  4005e4:   c5 f8 28 85 50 ff ff    vmovaps -0xb0(%rbp),%xmm0  4005eb:   ff   4005ec:   c5 f8 29 45 c0          vmovaps %xmm0,-0x40(%rbp)  4005f1:   c5 f8 28 45 f0          vmovaps -0x10(%rbp),%xmm0  4005f6:   c5 f8 29 45 b0          vmovaps %xmm0,-0x50(%rbp)  4005fb:   c5 f8 28 45 b0          vmovaps -0x50(%rbp),%xmm0  400600:   c5 f8 28 4d c0          vmovaps -0x40(%rbp),%xmm1  400605:   c5 f0 58 c0             vaddps %xmm0,%xmm1,%xmm0  400609:   c5 f8 29 45 e0          vmovaps %xmm0,-0x20(%rbp)  40060e:   c5 f8 28 45 f0          vmovaps -0x10(%rbp),%xmm0  400613:   c5 f8 29 45 a0          vmovaps %xmm0,-0x60(%rbp)  400618:   c5 f8 28 45 e0          vmovaps -0x20(%rbp),%xmm0  40061d:   c5 f8 29 45 90          vmovaps %xmm0,-0x70(%rbp)  400622:   c5 f8 28 45 90          vmovaps -0x70(%rbp),%xmm0  400627:   c5 f8 28 4d a0          vmovaps -0x60(%rbp),%xmm1  40062c:   c5 f0 12 c0             vmovhlps %xmm0,%xmm1,%xmm0  400630:   c5 f8 29 45 f0          vmovaps %xmm0,-0x10(%rbp)  400635:   c5 f8 28 45 e0          vmovaps -0x20(%rbp),%xmm0  40063a:   c5 f8 29 45 80          vmovaps %xmm0,-0x80(%rbp)  40063f:   c5 f8 28 45 f0          vmovaps -0x10(%rbp),%xmm0  400644:   c5 f8 29 85 70 ff ff    vmovaps %xmm0,-0x90(%rbp)  40064b:   ff   40064c:   c5 f8 28 45 80          vmovaps -0x80(%rbp),%xmm0  400651:   c5 fa 58 85 70 ff ff    vaddss -0x90(%rbp),%xmm0,%xmm0  400658:   ff   400659:   c5 f8 29 45 e0          vmovaps %xmm0,-0x20(%rbp)  40065e:   c5 f8 28 45 e0          vmovaps -0x20(%rbp),%xmm0  400663:   c5 f8 29 85 60 ff ff    vmovaps %xmm0,-0xa0(%rbp)  40066a:   ff   40066b:   c5 f8 28 85 60 ff ff    vmovaps -0xa0(%rbp),%xmm0  400672:   ff   400673:   c5 f8 28 c0             vmovaps %xmm0,%xmm0  400677:   c5 fa 11 85 4c ff ff    vmovss %xmm0,-0xb4(%rbp)  40067e:   ff   40067f:   8b 85 4c ff ff ff       mov    -0xb4(%rbp),%eax  400685:   89 85 4c ff ff ff       mov    %eax,-0xb4(%rbp)  40068b:   c5 fa 10 85 4c ff ff    vmovss -0xb4(%rbp),%xmm0  400692:   ff   400693:   c9                      leaveq   400694:   c3                      retq

I don't know if it makes a difference, but I'm compiling this code on a CentOS VM running on Windows. Just to make sure, I downloaded Coreinfo from here and got the following output

FPU             *       Implements i387 floating point instructionsMMX             *       Supports MMX instruction setMMXEXT          -       Implements AMD MMX extensions3DNOW           -       Supports 3DNow! instructions3DNOWEXT        -       Supports 3DNow! extension instructionsSSE             *       Supports Streaming SIMD ExtensionsSSE2            *       Supports Streaming SIMD Extensions 2SSE3            *       Supports Streaming SIMD Extensions 3SSSE3           *       Supports Supplemental SIMD Extensions 3SSE4a           -       Supports Streaming SIMDR Extensions 4aSSE4.1          *       Supports Streaming SIMD Extensions 4.1SSE4.2          *       Supports Streaming SIMD Extensions 4.2

so it seems like my CPU should be able to use the SSE instructions that I wrote in the C file. I also checked my GCC version (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)), which seems compatible too. How can I get a more efficient compiled function?

GCC not using SSE intrinsics in compiled code [duplicate]

Trending Articles

High police presence reported in Broadfield including helicopter

Police helicopter searches for Crawley suspect who made off from...

Who Is Sisanda Jonas? | Biography| Profile| History Of South African Media...

Practice Sheet of Right form of verbs for HSC Students

Kundi Mat Khadkao Raja Lyrics Translation | Gabbar is Back

LINKIN PARK – From Zero [iTunes Plus M4A]

Yesenia Carballido Arrested by Miami-Dade County Corrections on Jan 03, 2020

1981-Depeche Mode - Speak & Spell multichannel WAV RE UP

Field Hockey Tactics and Strategies Set

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

Black Angus Grilled Artichokes

Who died from the T.V. Show pawn stars ?? #pawnstars

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

[SG News]Lady Who Shot Sexy Photos Of Herself At HDB Flats Is Exposed As Natalie

Mahanoy Area board sets new graduation date

Mp3 Download: Mmatema Moremi - Ke Lerato

Office 365: Auto-Expanding Archives FAQ

Issues installing KB2533623 on Windows Server 2008 R2 SP1 64-Bit

Daru and Sharab Status for Sharabi Friends in Hindi, Punjabi

[GET] Ayesha Santos – Powerhouse Portfolio ($359.00)