I'm doing some testing to see what the fastest way of computing the dot product of two vectors is for me, and if I can find a way that's faster than simply a.x * b.x + a.y * b.y + a.z * b.z
. I've been looking at a lot of different posts on here, and I decided to try one of the functions from this answer.
I have the following function in my C file:
float hsum_sse1(__m128 v) { __m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0 __m128 sums = _mm_add_ps(v, shuf); shuf = _mm_movehl_ps(shuf, sums); // high half -> low half sums = _mm_add_ss(sums, shuf); return _mm_cvtss_f32(sums);}
and I compiled it with gcc -std=c11 -march=native main.c
, but when I did objdump
to look at the generated assembly, I got a function that doesn't use the intrinsics that I specified:
00000000004005bd <hsum_sse1>: 4005bd: 55 push %rbp 4005be: 48 89 e5 mov %rsp,%rbp 4005c1: 48 83 ec 3c sub $0x3c,%rsp 4005c5: c5 f8 29 85 50 ff ff vmovaps %xmm0,-0xb0(%rbp) 4005cc: ff 4005cd: c5 f8 28 85 50 ff ff vmovaps -0xb0(%rbp),%xmm0 4005d4: ff 4005d5: c5 f8 29 45 d0 vmovaps %xmm0,-0x30(%rbp) 4005da: c5 fa 16 45 d0 vmovshdup -0x30(%rbp),%xmm0 4005df: c5 f8 29 45 f0 vmovaps %xmm0,-0x10(%rbp) 4005e4: c5 f8 28 85 50 ff ff vmovaps -0xb0(%rbp),%xmm0 4005eb: ff 4005ec: c5 f8 29 45 c0 vmovaps %xmm0,-0x40(%rbp) 4005f1: c5 f8 28 45 f0 vmovaps -0x10(%rbp),%xmm0 4005f6: c5 f8 29 45 b0 vmovaps %xmm0,-0x50(%rbp) 4005fb: c5 f8 28 45 b0 vmovaps -0x50(%rbp),%xmm0 400600: c5 f8 28 4d c0 vmovaps -0x40(%rbp),%xmm1 400605: c5 f0 58 c0 vaddps %xmm0,%xmm1,%xmm0 400609: c5 f8 29 45 e0 vmovaps %xmm0,-0x20(%rbp) 40060e: c5 f8 28 45 f0 vmovaps -0x10(%rbp),%xmm0 400613: c5 f8 29 45 a0 vmovaps %xmm0,-0x60(%rbp) 400618: c5 f8 28 45 e0 vmovaps -0x20(%rbp),%xmm0 40061d: c5 f8 29 45 90 vmovaps %xmm0,-0x70(%rbp) 400622: c5 f8 28 45 90 vmovaps -0x70(%rbp),%xmm0 400627: c5 f8 28 4d a0 vmovaps -0x60(%rbp),%xmm1 40062c: c5 f0 12 c0 vmovhlps %xmm0,%xmm1,%xmm0 400630: c5 f8 29 45 f0 vmovaps %xmm0,-0x10(%rbp) 400635: c5 f8 28 45 e0 vmovaps -0x20(%rbp),%xmm0 40063a: c5 f8 29 45 80 vmovaps %xmm0,-0x80(%rbp) 40063f: c5 f8 28 45 f0 vmovaps -0x10(%rbp),%xmm0 400644: c5 f8 29 85 70 ff ff vmovaps %xmm0,-0x90(%rbp) 40064b: ff 40064c: c5 f8 28 45 80 vmovaps -0x80(%rbp),%xmm0 400651: c5 fa 58 85 70 ff ff vaddss -0x90(%rbp),%xmm0,%xmm0 400658: ff 400659: c5 f8 29 45 e0 vmovaps %xmm0,-0x20(%rbp) 40065e: c5 f8 28 45 e0 vmovaps -0x20(%rbp),%xmm0 400663: c5 f8 29 85 60 ff ff vmovaps %xmm0,-0xa0(%rbp) 40066a: ff 40066b: c5 f8 28 85 60 ff ff vmovaps -0xa0(%rbp),%xmm0 400672: ff 400673: c5 f8 28 c0 vmovaps %xmm0,%xmm0 400677: c5 fa 11 85 4c ff ff vmovss %xmm0,-0xb4(%rbp) 40067e: ff 40067f: 8b 85 4c ff ff ff mov -0xb4(%rbp),%eax 400685: 89 85 4c ff ff ff mov %eax,-0xb4(%rbp) 40068b: c5 fa 10 85 4c ff ff vmovss -0xb4(%rbp),%xmm0 400692: ff 400693: c9 leaveq 400694: c3 retq
I don't know if it makes a difference, but I'm compiling this code on a CentOS VM running on Windows. Just to make sure, I downloaded Coreinfo from here and got the following output
FPU * Implements i387 floating point instructionsMMX * Supports MMX instruction setMMXEXT - Implements AMD MMX extensions3DNOW - Supports 3DNow! instructions3DNOWEXT - Supports 3DNow! extension instructionsSSE * Supports Streaming SIMD ExtensionsSSE2 * Supports Streaming SIMD Extensions 2SSE3 * Supports Streaming SIMD Extensions 3SSSE3 * Supports Supplemental SIMD Extensions 3SSE4a - Supports Streaming SIMDR Extensions 4aSSE4.1 * Supports Streaming SIMD Extensions 4.1SSE4.2 * Supports Streaming SIMD Extensions 4.2
so it seems like my CPU should be able to use the SSE instructions that I wrote in the C file. I also checked my GCC version (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)
), which seems compatible too. How can I get a more efficient compiled function?