Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22112

Why does gcc add extra vmovdqa64 instructions with _mm512_dpbusds_epi32?

$
0
0

g++ is generating extra move instructions and I'm not sure why. Godbolt link

This is happening around the _mm512_dpbusds_epi32 intrinsic. The instruction computes 8-bit dot products then adds them to a packed 32-bit accumulator (in this case with a saturating add). The instruction is a bit unusual in that it both reads from and writes to the accumulator.

When compiled with gcc, the compiler is emitting extra move instructions (vmovdqa64) on the accumulator.

Here's a test program that accumulates some dot products:

#include <immintrin.h>#include <cstddef>__m512i Slow(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t count) {  __m512i c0 = _mm512_setzero_epi32();  __m512i c1 = _mm512_setzero_epi32();  for (std::size_t i = 0; i < count; ++i) {    c0 = _mm512_dpbusds_epi32(c0, a[i], b0);    c1 = _mm512_dpbusds_epi32(c1, a[i], b1);  }  // Do not optimize away  return _mm512_sub_epi32(c0, c1);}

When compiled with g++ -O3 -mavx512vnni example.cc -S, this is the main loop:

.L3:  vmovdqa64 (%rdi), %zmm6  vmovdqa64 %zmm3, %zmm0  vmovdqa64 %zmm4, %zmm2  addq  $64, %rdi  vpdpbusds %zmm5, %zmm6, %zmm0  vpdpbusds %zmm1, %zmm6, %zmm2  vmovdqa64 %zmm0, %zmm3  vmovdqa64 %zmm2, %zmm4  cmpq  %rdi, %rax  jne .L3

The above assembly is copying an accumulator from zmm3 to zmm0, updating zmm0, and copying it back to zmm3. This is unnecessary; it should just use one of zmm0 or zmm3 as an accumulator.

The problem is the same on g++ (Gentoo 9.2.0-r2 p3) 9.2.0 and g++ (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0.

clang++ 9.0.1 avoids the unnecessary copying (it also unrolled the loop, but here is the tightest version.)

.LBB0_6:                                # =>This Inner Loop Header: Depth=1  vmovaps (%rdi), %zmm4  vpdpbusds %zmm0, %zmm4, %zmm3  vpdpbusds %zmm1, %zmm4, %zmm2  addq  $64, %rdi  addq  $-1, %rax  jne .LBB0_6

I was able to work around the problem in g++ by using inline asm.

#include <immintrin.h>#include <cstddef>__m512i Fast(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t count) {  __m512i c0 = _mm512_setzero_epi32();  __m512i c1 = _mm512_setzero_epi32();  for (std::size_t i = 0; i < count; ++i) {    asm ("vpdpbusds %2, %1, %0" : "+x"(c0) : "x"(a[i]), "mx"(b0));    asm ("vpdpbusds %2, %1, %0" : "+x"(c1) : "x"(a[i]), "mx"(b1));  }  // Do not optimize away  return _mm512_sub_epi32(c0, c1);}

The loop g++ generates for Fast is much better:

.L3:#APP# 7 "asm.cc" 1  vpdpbusds (%rdi), %zmm3, %zmm0# 0 "" 2# 8 "asm.cc" 1  vpdpbusds (%rdi), %zmm1, %zmm2# 0 "" 2#NO_APP  addq  $64, %rdi  cmpq  %rax, %rdi  jne .L3

Viewing all articles
Browse latest Browse all 22112

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>