Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22162

How to generate non-temporal instructions?

$
0
0

Intel's compiler has a pragma that can be used to generate non-temporal stores. For example, I can write

void square(const double* x, double* y, int n) {#pragma vector nontemporal  for (int i=0; i<n; ++i) {    y[i] = x[i] * x[i];  }}

and ICC will generate instructions like this (compiler-explorer)

...  vmovntpd %ymm1, (%rsi,%r9,8) #4.5...

Do gcc and clang have anything similar? (other than intrinsics)

The non-temporal store makes the code much faster. Using this benchmark

#include <random>#include <memory>#include <benchmark/benchmark.h>static void generate_random_numbers(double* x, int n) {  std::mt19937 rng{0};  std::uniform_real_distribution<double> dist{-1, 1};  for (int i=0; i<n; ++i) {    x[i] = dist(rng);  }}static void square(const double* x, double* y, int n) {#ifdef __INTEL_COMPILER#pragma vector nontemporal#endif  for (int i=0; i<n; ++i) {    y[i] = x[i] * x[i];  }}static void BM_Square(benchmark::State& state) {  const int n = state.range(0);  std::unique_ptr<double[]> xptr{new double[n]};  generate_random_numbers(xptr.get(), n);  for (auto _ : state) {    std::unique_ptr<double[]> yptr{new double[n]};    square(xptr.get(), yptr.get(), n);    benchmark::DoNotOptimize(yptr);  }}BENCHMARK(BM_Square)->Arg(1000000);BENCHMARK_MAIN();

the non-temporal code runs almost twice as fast on my machine. Here are the full results:

icc:

> icc -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main> ./a.out------------------------------------------------------------Benchmark                  Time             CPU   Iterations------------------------------------------------------------BM_Square/1000000     430889 ns       430889 ns         1372

clang:

> clang++ -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main> ./a.out------------------------------------------------------------Benchmark                  Time             CPU   Iterations------------------------------------------------------------BM_Square/1000000     781672 ns       781470 ns          820

gcc:

> g++-mp-10 -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main> ./a.out------------------------------------------------------------Benchmark                  Time             CPU   Iterations------------------------------------------------------------BM_Square/1000000     681684 ns       681533 ns          782

Note: clang has __builtin_nontemporal_store; but when I try it, it won't generate non-temporal instructions (compiler-explorer)


Viewing all articles
Browse latest Browse all 22162

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>