I am working on a project that requires automatic vectorization of large loops. It is mandatory to use GCC to compile. A minimum case of the problem could be the following:
#define VLEN 4#define NTHREADS 4#define AVX512_ALIGNMENT 64#define NUM_INTERNAL_ITERS 5#define real doubletypedef struct private_data { /* * Alloc enough space for private data and MEM_BLOCK_SIZE bytes of padding. * Private data must be allocated all at once to squeeze cache performance by only * padding once per CPU. */ real *contiguous_data; /* * Pointers to corresponding index in contiguous_data. */ real *array_1; real *array_2;} private_data_t;private_data_t private_data[NTHREADS];int num_iter;void minimum_case(const int thread) { // Reference to thread private data. real *restrict array_1 = __builtin_assume_aligned(private_data[thread].array_1, AVX512_ALIGNMENT); real *restrict array_2 = __builtin_assume_aligned(private_data[thread].array_2, AVX512_ALIGNMENT); for (int i = 0; i < num_iter; i++) { for (int k = 0; k < NUM_INTERNAL_ITERS; ++k) { int array_1_entry = (k * (NUM_INTERNAL_ITERS) * VLEN) + i * NUM_INTERNAL_ITERS * NUM_INTERNAL_ITERS * VLEN; int array_2_entry = (k * (NUM_INTERNAL_ITERS) * VLEN) + i * NUM_INTERNAL_ITERS * VLEN;#pragma GCC unroll 1#pragma GCC ivdep for (int j = 0; j < VLEN; j++) { real pivot; int a_idx = array_1_entry + VLEN * 0 + j; int b_idx = array_1_entry + VLEN * 1 + j; int c_idx = array_1_entry + VLEN * 2 + j; int d_idx = array_1_entry + VLEN * 3 + j; int S_idx = array_2_entry + VLEN * 0 + j; if (k == 0) { pivot = array_1[a_idx]; // b = b / a array_1[b_idx] /= pivot; // c = c / a array_1[c_idx] /= pivot; // d = d / a array_1[d_idx] /= pivot; // S = S / a array_2[S_idx] /= pivot; } int e_idx = array_1_entry + VLEN * 4 + j; int f_idx = array_1_entry + VLEN * 5 + j; int g_idx = array_1_entry + VLEN * 6 + j; int k_idx = array_1_entry + VLEN * 7 + j; int T_idx = array_2_entry + VLEN * 1 + j; pivot = array_1[e_idx]; // f = f - (e * b) array_1[f_idx] -= array_1[b_idx] * pivot; // g = g - (e * c) array_1[g_idx] -= array_1[c_idx] * pivot; // k = k - (e * d) array_1[k_idx] -= array_1[d_idx] * pivot; // T = T - (e * S) array_2[T_idx] -= array_2[S_idx] * pivot; } } }}
For this specific case, GCC is using 16B vectors instead of 32B ones for automatic vectorization. It is fairly easy to see that the control flow depends on a condition that can be checked out of the internal loop, but GCC is not performing any loop-unswitching.
The loop unswitching can be done manually, but please, note that this is a minimum case of the problem, the real loop has hundreds of lines and performing manual loop-unswitching would result in a lot of code redundancy. I am trying to find a way to force GCC to create different loops for different conditions that can be checked out of the internal loop.
Currently I am using GCC 9.2 with the following flags: -Ofast -march=native -std=c11 -fopenmp -ftree-vectorize -ffast-math -mavx -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -fopt-info-vec-optimized