Consider the following loop:
template <typename T>
void copytail(T* __restrict__ dest, const T* __restrict__ src, size_t count) {
constexpr size_t chunk_size = 4 * 32;
size_t byte_count = sizeof(T) * count;
size_t chunks = byte_count / chunk_size;
auto rest = byte_count - byte_count / chunk_size * chunk_size;
auto rest_vecs = (rest + 31) / 32;
__m256i* dest256 = (__m256i*)((char *)dest + byte_count - rest_vecs * 32);
__m256i* src256 = (__m256i*)((char *)src + byte_count - rest_vecs * 32);
for (size_t j = 0; j < rest_vecs; j++) {
_mm256_storeu_si256(dest256 + j, _mm256_loadu_si256(src256 + j));
}
}
void tail_copy(char* d, const char* s, size_t overshoot) {
copytail(d, s, overshoot);
}
Don't think too hard about what it does, as it is a reduced test case based on a more complete function - but basically it copies up to 4 AVX2 vectors from src
to dest
, aligned to the end of the regions.
For whatever reason1, gcc 8.1 at -O3
produces this odd assembly:
tail_copy(char*, char const*, unsigned long):
mov rax, rdx
and eax, 127
add rax, 31
mov rcx, rax
and rcx, -32
sub rdx, rcx
shr rax, 5
je .L30
sal rax, 5
mov r8d, eax
add rdi, rdx
add rsi, rdx
test dil, 1
jne .L32
.L3:
test dil, 2
jne .L33
.L4:
test dil, 4
jne .L34
.L5:
mov ecx, r8d
shr ecx, 3
rep movsq # oh please no
xor eax, eax
test r8b, 4
jne .L35
test r8b, 2
jne .L36
# many more tail-handling cases follow
Basically a rep movsq
to invoke microcode for the main copy, and then a bunch of tail-handling code to handle the odd bytes (most not show, the full assembly can be seen on godbolt).
This is an order of magnitude slower than vmovdqu
loads/stores in my case.
And even if it was going to use rep movs
, the CPU has ERMSB so rep movsb
could probably do the exact number of bytes with no extra cleanup needed about as efficiently as rep movsq
. But the CPU does not have the "fast short rep" feature (Ice Lake) so we rep movs
startup overhead is a big problem.
I'd like gcc to emit my copy loop more or less as written - at least the 32-byte AVX2 loads and store should appear as in the source. Importantly, I want this to be local to this function: that is, not change the compiler arguments.
1 Probably it's memcpy
recognition followed by memcpy
inlining.