Both clang (10) and gcc (9.3) suffer from this. I have a few global variables. I iterate through data_in and write to both data_out and misc_data (but this one a lot less). My runtime goes from 70ms to 300+
Doing a diff on the asm I see %fs:
and data@TPOFF
everywhere. What are they, how does TLS work and why is this so slow? I assumed TLS was virtual memory mapped differently for each thread so I guess I assumed wrong and there's more to this?
thread_local u8 data_in[1024*1024*100] __attribute__ ((aligned(16)));thread_local u16 data_out[sizeof(data_in)] __attribute__ ((aligned(16)));thread_local u8 misc_data[sizeof(data_in)] __attribute__ ((aligned(16)));