Intel's compiler has a pragma that can be used to generate non-temporal stores. For example, I can write
void square(const double* x, double* y, int n) {#pragma vector nontemporal for (int i=0; i<n; ++i) { y[i] = x[i] * x[i]; }}
and ICC will generate instructions like this (compiler-explorer)
... vmovntpd %ymm1, (%rsi,%r9,8) #4.5...
Do gcc and clang have anything similar? (other than intrinsics)
The non-temporal store makes the code much faster. Using this benchmark
#include <random>#include <memory>#include <benchmark/benchmark.h>static void generate_random_numbers(double* x, int n) { std::mt19937 rng{0}; std::uniform_real_distribution<double> dist{-1, 1}; for (int i=0; i<n; ++i) { x[i] = dist(rng); }}static void square(const double* x, double* y, int n) {#ifdef __INTEL_COMPILER#pragma vector nontemporal#endif for (int i=0; i<n; ++i) { y[i] = x[i] * x[i]; }}static void BM_Square(benchmark::State& state) { const int n = state.range(0); std::unique_ptr<double[]> xptr{new double[n]}; generate_random_numbers(xptr.get(), n); for (auto _ : state) { std::unique_ptr<double[]> yptr{new double[n]}; square(xptr.get(), yptr.get(), n); benchmark::DoNotOptimize(yptr); }}BENCHMARK(BM_Square)->Arg(1000000);BENCHMARK_MAIN();
the non-temporal code runs almost twice as fast on my machine. Here are the full results:
icc:
> icc -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main> ./a.out------------------------------------------------------------Benchmark Time CPU Iterations------------------------------------------------------------BM_Square/1000000 430889 ns 430889 ns 1372
clang:
> clang++ -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main> ./a.out------------------------------------------------------------Benchmark Time CPU Iterations------------------------------------------------------------BM_Square/1000000 781672 ns 781470 ns 820
gcc:
> g++-mp-10 -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main> ./a.out------------------------------------------------------------Benchmark Time CPU Iterations------------------------------------------------------------BM_Square/1000000 681684 ns 681533 ns 782
Note: clang has __builtin_nontemporal_store; but when I try it, it won't generate non-temporal instructions (compiler-explorer)