I would like to implement a 2 thread model where 1 is counting (infinitely increment a value) and the other one is recording the first counter, do the job, record the second recording and measure the time elapsed between.
Here is what I have done so far:
// global counter
register unsigned long counter asm("r13");
// unsigned long counter;
void* counter_thread(){
// affinity is set to some isolated CPU so the noise will be minimal
while(1){
//counter++; // Line 1*
asm volatile("add $1, %0" : "+r"(counter) : ); // Line 2*
}
}
void* measurement_thread(){
// affinity is set somewhere over here
unsigned long meas = 0;
unsigned long a = 5;
unsigned long r1,r2;
sleep(1.0);
while(1){
mfence();
r1 = counter;
a *=3; // dummy operation that I want to measure
r2 = counter;
mfence();
meas = r2-r1;
printf("counter:%ld \n", counter);
break;
}
}
Let me explain what I have done so far:
Since I want the counter to be accurate, I am setting the affinity to an isolated CPU. Also, If I use the counter in Line 1*, the dissassambled function will be:
d4c: 4c 89 e8 mov %r13,%rax
d4f: 48 83 c0 01 add $0x1,%rax
d53: 49 89 c5 mov %rax,%r13
d56: eb f4 jmp d4c <counter_thread+0x37>
Which is not 1 cycle operation. That is why I have used inline assembly to decrease 2 mov instructions. Using the inline assembly:
d4c: 49 83 c5 01 add $0x1,%r13
d50: eb fa jmp d4c <counter_thread+0x37>
But the thing is, both implementations are not working. The other thread cannot see the counter being updated. If I make the global counter value not a register, then it is working, but I want to be precise. If I make global counter value to unsigned long counter
then the disassembled code of counter thread is:
d4c: 48 8b 05 ed 12 20 00 mov 0x2012ed(%rip),%rax # 202040 <counter>
d53: 48 83 c0 01 add $0x1,%rax
d57: 48 89 05 e2 12 20 00 mov %rax,0x2012e2(%rip) # 202040 <counter>
d5e: eb ec jmp d4c <counter_thread+0x37>
It works but it doesn't give me the granularity that I want.
EDIT:
My environment:
- CPU: AMD Ryzen 3600
- kernel: 5.0.0-32-generic
- OS: Ubuntu 18.04
EDIT2: I have isolated 2 neighbor CPU cores (i.e. core 10 and 11) and running the experiment on those cores. The counter is on one of the cores, measurement is on the other. Isolation is done by using /etc/default/grub file and adding isolcpus line.
EDIT3: I know that one measurement is not enough. I have run the experiment 10 million times and looked at the results.
Experiment1: Setup:
unsigned long counter =0;//global counter
void* counter_thread(){
mfence();
while(1)
counter++;
}
void* measurement_thread(){
unsigned long i=0, r1=0,r2=0;
unsigned int a=0;
sleep(1.0);
while(1){
mfence();
r1 = counter;
a +=3;
r2 = counter;
mfence();
measurements[r2-r1]++;
i++;
if(i == MILLION_ITER)
break;
}
}
Results1: In 99.99% I got 0. Which I expect because either first thread is not running, or OS or other interrupts disturb the measurement. Getting rid of the 0's and very high values gives me 20 cycles of measurement on the average. (I was expecting 3-4 because I only do an integer addition).
Experiment2:
Setup: Identically the same as above, one difference is, instead of global counter, I use the counter as register:
register unsigned long counter asm("r13");
Results2: Measurement thread always reads 0. In disassembled code, I can see that both are dealing with R13 register (counter), however, I believe that it is not somehow shared.
Experiment3:
Setup: Identical to the setup2, except in the counter thread, instead of doing counter++, I am doing an inline assembly to make sure that I am doing 1 cycle operation. My disassembled file looks like this:
cd1: 49 83 c5 01 add $0x1,%r13
cd5: eb fa jmp cd1 <counter_thread+0x37>
Results3: Measurement thread reads 0 as above.