Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22113

Explicitly telling GCC 9.2 to unswitch loop to allow auto-vectorization

$
0
0

I am working on a project that requires automatic vectorization of large loops. It is mandatory to use GCC to compile. A minimum case of the problem could be the following:

#define VLEN 4#define NTHREADS 4#define AVX512_ALIGNMENT 64#define NUM_INTERNAL_ITERS 5#define real doubletypedef struct private_data {    /*     * Alloc enough space for private data and MEM_BLOCK_SIZE bytes of padding.     * Private data must be allocated all at once to squeeze cache performance by only     * padding once per CPU.     */    real *contiguous_data;    /*     * Pointers to corresponding index in contiguous_data.     */    real *array_1;    real *array_2;} private_data_t;private_data_t private_data[NTHREADS];int num_iter;void minimum_case(const int thread) {    // Reference to thread private data.    real *restrict array_1 =        __builtin_assume_aligned(private_data[thread].array_1, AVX512_ALIGNMENT);    real *restrict array_2 =        __builtin_assume_aligned(private_data[thread].array_2, AVX512_ALIGNMENT);    for (int i = 0; i < num_iter; i++) {        for (int k = 0; k < NUM_INTERNAL_ITERS; ++k) {            int array_1_entry =                (k * (NUM_INTERNAL_ITERS) * VLEN) +                i * NUM_INTERNAL_ITERS * NUM_INTERNAL_ITERS * VLEN;            int array_2_entry =                (k * (NUM_INTERNAL_ITERS) * VLEN) +                i * NUM_INTERNAL_ITERS * VLEN;#pragma GCC unroll 1#pragma GCC ivdep            for (int j = 0; j < VLEN; j++) {                real pivot;                int a_idx = array_1_entry + VLEN * 0 + j;                int b_idx = array_1_entry + VLEN * 1 + j;                int c_idx = array_1_entry + VLEN * 2 + j;                int d_idx = array_1_entry + VLEN * 3 + j;                int S_idx = array_2_entry + VLEN * 0 + j;                if (k == 0) {                    pivot = array_1[a_idx];                    // b = b / a                    array_1[b_idx] /= pivot;                    // c = c / a                    array_1[c_idx] /= pivot;                    // d = d / a                    array_1[d_idx] /= pivot;                    // S = S / a                    array_2[S_idx] /= pivot;                }                int e_idx = array_1_entry + VLEN * 4 + j;                int f_idx = array_1_entry + VLEN * 5 + j;                int g_idx = array_1_entry + VLEN * 6 + j;                int k_idx = array_1_entry + VLEN * 7 + j;                int T_idx = array_2_entry + VLEN * 1 + j;                pivot = array_1[e_idx];                // f = f - (e * b)                array_1[f_idx] -= array_1[b_idx]                                  * pivot;                // g = g - (e * c)                array_1[g_idx] -= array_1[c_idx]                                  * pivot;                // k = k - (e * d)                array_1[k_idx] -= array_1[d_idx]                                  * pivot;                // T = T - (e * S)                array_2[T_idx] -= array_2[S_idx]                                  * pivot;            }        }    }}

For this specific case, GCC is using 16B vectors instead of 32B ones for automatic vectorization. It is fairly easy to see that the control flow depends on a condition that can be checked out of the internal loop, but GCC is not performing any loop-unswitching.

The loop unswitching can be done manually, but please, note that this is a minimum case of the problem, the real loop has hundreds of lines and performing manual loop-unswitching would result in a lot of code redundancy. I am trying to find a way to force GCC to create different loops for different conditions that can be checked out of the internal loop.

Currently I am using GCC 9.2 with the following flags: -Ofast -march=native -std=c11 -fopenmp -ftree-vectorize -ffast-math -mavx -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -fopt-info-vec-optimized


Viewing all articles
Browse latest Browse all 22113


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>