Why does gcc add extra vmovdqa64 instructions with _mm512_dpbusds

g++ is generating extra move instructions and I'm not sure why. Godbolt link

This is happening around the _mm512_dpbusds_epi32 intrinsic. The instruction computes 8-bit dot products then adds them to a packed 32-bit accumulator (in this case with a saturating add). The instruction is a bit unusual in that it both reads from and writes to the accumulator.

When compiled with gcc, the compiler is emitting extra move instructions (vmovdqa64) on the accumulator.

Here's a test program that accumulates some dot products:

#include <immintrin.h>#include <cstddef>__m512i Slow(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t count) {  __m512i c0 = _mm512_setzero_epi32();  __m512i c1 = _mm512_setzero_epi32();  for (std::size_t i = 0; i < count; ++i) {    c0 = _mm512_dpbusds_epi32(c0, a[i], b0);    c1 = _mm512_dpbusds_epi32(c1, a[i], b1);  }  // Do not optimize away  return _mm512_sub_epi32(c0, c1);}

When compiled with g++ -O3 -mavx512vnni example.cc -S, this is the main loop:

.L3:  vmovdqa64 (%rdi), %zmm6  vmovdqa64 %zmm3, %zmm0  vmovdqa64 %zmm4, %zmm2  addq  $64, %rdi  vpdpbusds %zmm5, %zmm6, %zmm0  vpdpbusds %zmm1, %zmm6, %zmm2  vmovdqa64 %zmm0, %zmm3  vmovdqa64 %zmm2, %zmm4  cmpq  %rdi, %rax  jne .L3

The above assembly is copying an accumulator from zmm3 to zmm0, updating zmm0, and copying it back to zmm3. This is unnecessary; it should just use one of zmm0 or zmm3 as an accumulator.

The problem is the same on g++ (Gentoo 9.2.0-r2 p3) 9.2.0 and g++ (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0.

clang++ 9.0.1 avoids the unnecessary copying (it also unrolled the loop, but here is the tightest version.)

.LBB0_6:                                # =>This Inner Loop Header: Depth=1  vmovaps (%rdi), %zmm4  vpdpbusds %zmm0, %zmm4, %zmm3  vpdpbusds %zmm1, %zmm4, %zmm2  addq  $64, %rdi  addq  $-1, %rax  jne .LBB0_6

I was able to work around the problem in g++ by using inline asm.

#include <immintrin.h>#include <cstddef>__m512i Fast(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t count) {  __m512i c0 = _mm512_setzero_epi32();  __m512i c1 = _mm512_setzero_epi32();  for (std::size_t i = 0; i < count; ++i) {    asm ("vpdpbusds %2, %1, %0" : "+x"(c0) : "x"(a[i]), "mx"(b0));    asm ("vpdpbusds %2, %1, %0" : "+x"(c1) : "x"(a[i]), "mx"(b1));  }  // Do not optimize away  return _mm512_sub_epi32(c0, c1);}

The loop g++ generates for Fast is much better:

.L3:#APP# 7 "asm.cc" 1  vpdpbusds (%rdi), %zmm3, %zmm0# 0 "" 2# 8 "asm.cc" 1  vpdpbusds (%rdi), %zmm1, %zmm2# 0 "" 2#NO_APP  addq  $64, %rdi  cmpq  %rax, %rdi  jne .L3

Why does gcc add extra vmovdqa64 instructions with _mm512_dpbusds_epi32?

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...