g++
is generating extra move instructions and I'm not sure why. Godbolt link
This is happening around the _mm512_dpbusds_epi32 intrinsic. The instruction computes 8-bit dot products then adds them to a packed 32-bit accumulator (in this case with a saturating add). The instruction is a bit unusual in that it both reads from and writes to the accumulator.
When compiled with gcc, the compiler is emitting extra move instructions (vmovdqa64
) on the accumulator.
Here's a test program that accumulates some dot products:
#include <immintrin.h>#include <cstddef>__m512i Slow(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t count) { __m512i c0 = _mm512_setzero_epi32(); __m512i c1 = _mm512_setzero_epi32(); for (std::size_t i = 0; i < count; ++i) { c0 = _mm512_dpbusds_epi32(c0, a[i], b0); c1 = _mm512_dpbusds_epi32(c1, a[i], b1); } // Do not optimize away return _mm512_sub_epi32(c0, c1);}
When compiled with g++ -O3 -mavx512vnni example.cc -S
, this is the main loop:
.L3: vmovdqa64 (%rdi), %zmm6 vmovdqa64 %zmm3, %zmm0 vmovdqa64 %zmm4, %zmm2 addq $64, %rdi vpdpbusds %zmm5, %zmm6, %zmm0 vpdpbusds %zmm1, %zmm6, %zmm2 vmovdqa64 %zmm0, %zmm3 vmovdqa64 %zmm2, %zmm4 cmpq %rdi, %rax jne .L3
The above assembly is copying an accumulator from zmm3
to zmm0
, updating zmm0
, and copying it back to zmm3
. This is unnecessary; it should just use one of zmm0
or zmm3
as an accumulator.
The problem is the same on g++ (Gentoo 9.2.0-r2 p3) 9.2.0
and g++ (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
.
clang++
9.0.1 avoids the unnecessary copying (it also unrolled the loop, but here is the tightest version.)
.LBB0_6: # =>This Inner Loop Header: Depth=1 vmovaps (%rdi), %zmm4 vpdpbusds %zmm0, %zmm4, %zmm3 vpdpbusds %zmm1, %zmm4, %zmm2 addq $64, %rdi addq $-1, %rax jne .LBB0_6
I was able to work around the problem in g++
by using inline asm.
#include <immintrin.h>#include <cstddef>__m512i Fast(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t count) { __m512i c0 = _mm512_setzero_epi32(); __m512i c1 = _mm512_setzero_epi32(); for (std::size_t i = 0; i < count; ++i) { asm ("vpdpbusds %2, %1, %0" : "+x"(c0) : "x"(a[i]), "mx"(b0)); asm ("vpdpbusds %2, %1, %0" : "+x"(c1) : "x"(a[i]), "mx"(b1)); } // Do not optimize away return _mm512_sub_epi32(c0, c1);}
The loop g++
generates for Fast
is much better:
.L3:#APP# 7 "asm.cc" 1 vpdpbusds (%rdi), %zmm3, %zmm0# 0 "" 2# 8 "asm.cc" 1 vpdpbusds (%rdi), %zmm1, %zmm2# 0 "" 2#NO_APP addq $64, %rdi cmpq %rax, %rdi jne .L3