Consider this C code and the generated (by GCC) assembler code for it:
+ cat x.c
1 struct S
2 {
3 int a, b, c, d, e;
4 };
5 void foo(struct S *s)
6 {
7 s->a = 1;
8 s->b = 2;
9 s->c = 3;
10 s->d = 4;
11 s->e = 5;
12 }
build with: gcc -O3 -S x.c
. (Output trimmed of some assembler directives)
+ cat x.s
18 foo:
21 movdqa .LC0(%rip), %xmm0
22 movl $5, 16(%rdi)
23 movups %xmm0, (%rdi)
24 ret
28 .section .rodata.cst16,"aM",@progbits,16
29 .align 16
30 .LC0:
31 .long 1
32 .long 2
33 .long 3
34 .long 4
35 .ident "GCC: (GNU) 9.2.1 20190827 (Red Hat 9.2.1-1)"
At line 21, a single instruction loads the 16 bytes containing the values for fields a
through d
as data.
It seems unintuitive to me that typical performance would be better than doing four immediate store instructions. Wouldn't a stall on a (data) cache load be more likely?
(I believe that clang/LLVM also optimizes for x86 in this manner.)