Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22016

Can it really more optimal to fetch 16 bytes as data with a single instruction than to do four immediate stores of 4 bytes each?

$
0
0

Consider this C code and the generated (by GCC) assembler code for it:

 + cat x.c
 1  struct S
 2  {
 3    int a, b, c, d, e;
 4  };

 5  void foo(struct S *s)
 6  {
 7    s->a = 1;
 8    s->b = 2;
 9    s->c = 3;
10    s->d = 4;
11    s->e = 5;
12  }

build with: gcc -O3 -S x.c. (Output trimmed of some assembler directives)

+ cat x.s
18  foo:
21      movdqa  .LC0(%rip), %xmm0
22      movl    $5, 16(%rdi)
23      movups  %xmm0, (%rdi)
24      ret

28      .section    .rodata.cst16,"aM",@progbits,16
29      .align 16
30  .LC0:
31      .long   1
32      .long   2
33      .long   3
34      .long   4
35      .ident  "GCC: (GNU) 9.2.1 20190827 (Red Hat 9.2.1-1)"

At line 21, a single instruction loads the 16 bytes containing the values for fields a through d as data.

It seems unintuitive to me that typical performance would be better than doing four immediate store instructions. Wouldn't a stall on a (data) cache load be more likely?

(I believe that clang/LLVM also optimizes for x86 in this manner.)


Viewing all articles
Browse latest Browse all 22016

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>