Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22240

solution to rdtsc out of order execution?

$
0
0

I am trying to replace clock_gettime(CLOCK_REALTIME, &ts) with rdtsc to benchmark code execution time in terms of cpu cycles rather than server time. The execution time of the bench-marking code is critical for the software. I have tried running code on x86_64 3.20GHz ubuntu machine on an isolated core and got following numbers :

case 1 : clock get time : 24 nano seconds

void gettime(Timespec &ts) {        clock_gettime(CLOCK_REALTIME, &ts);}

case 2 : rdtsc (without mfence and compiler barrier) : 10 ns

void rdtsc(uint64_t& tsc) {        unsigned int lo,hi;        __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));        tsc = ((uint64_t)hi << 32) | lo;}

case 3 : rdtsc (with mfence and compiler barrier) : 30 ns

void rdtsc(uint64_t& tsc) {        unsigned int lo,hi;        __asm__ __volatile__ ("mfence;rdtsc" : "=a" (lo), "=d" (hi) :: "memory");        tsc = ((uint64_t)hi << 32) | lo;}

Issue here is I am aware of rdtsc being a non-serializing call and can be reordered by the CPU, an alternative is rdtscp which is a serializing call but instructions after rdtscp call can be reordered before rdtscp call. Using memory barrier is increasing the execution time.

  • What is the most optimised and best way to benchmark a latency sensitive code ?
  • Is there anyway to optimise the cases I mentioned ?

Viewing all articles
Browse latest Browse all 22240

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>