I'm trying to multiply vectors of uint32_t producing the full 64bit result in an uint64_t vector in gcc. The result I expect is for gcc to emit a single VPMULUDQ instruction. But what gcc outputs as code is horrible shuffling around of the individual uint32_t of the source vectors and then a full 64*64=64 multiplication. Here is what I've tried:
#include <stdint.h>
typedef uint32_t v8lu __attribute__ ((vector_size (32)));
typedef uint64_t v4llu __attribute__ ((vector_size (32)));
v4llu mul(v8lu x, v8lu y) {
x[1] = 0; x[3] = 0; x[5] = 0; x[7] = 0;
y[1] = 0; y[3] = 0; y[5] = 0; y[7] = 0;
return (v4llu)x * (v4llu)y;
}
The first masks out the unwanted parts of the uint32_t vector in the hope gcc would optimize away the unneeded parts of the 64*64=64 multiplication and then see the masking is pointless as well. No such luck.
v4llu mul2(v8lu x, v8lu y) {
v4llu tx = {x[0], x[2], x[4], x[6]};
v4llu ty = {y[0], y[2], y[4], y[6]};
return tx * ty;
}
Here I try to create a uint64_t vector from scratch with only the used parts set. Again gcc should see the top 32bit of each uint64_t are 0 and not do a full 64*64=64 multiply. Instead a lot of extracting and putting back of the values happens and a 64*64=64 multiply.
v4llu mul3(v8lu x, v8lu y) {
v4llu t = {x[0] * y[0], x[2] * y[2], x[4] * y[4], x[6] * y[6]};
return t;
}
Lets build the result vector by multiplying the parts. Maybe gcc sees that it can use VPMULUDQ to achieve exatly that. No luck, it falls back to 4 IMUL opcodes.
Is there a way to tell gcc what I want it to do (32*32=64 multiplication with everything pefectly placed)?
Note: Inline asm or the intrinsic isn't the answere. Writing the opcode by hand obviously works. I want gcc to understand the problem and produce the right solution.