I'm trying to execute a non-temporal load using the VMOVNTDQA
instruction on a data array that has been allocated using posix_memalign()
(assume this is write-combining by changing the library), with the alignment done to 16B. However I keep getting segfaults. uint64 and uint128 are of typedef long long and __int128 respectively. Here's a code snippet:
uint64* arr;
posix_memalign((void**) &arr, 16, arr_size * sizeof(uint64))
uint128 b;
//index is a uint64 type and calculated earlier
asm volatile ("vmovntdqa %1, %0" : "=x" (b) : "m" (arr[index]));
//additional code working on b here, result stored back to arr[index]
The VMOVNTDQA
spec says that the instruction is of type VMOVNTDQA xmm1, m128
, and addresses must be 128-bit (16-byte) aligned. Now the above code is aligning the addresses to 16B. The above code works fine and does not give any segfaults if arr if of type uint128. However, I should be able to load a 128-bit value from a 64-bit element array if it is aligned.
My question is whether the segfault occurs due to m128 only accepting __int128 type elements? or is it an alignment issue? or is there a problem with the above asm syntax?
Thanks