How to generate non-temporal instructions?

Intel's compiler has a pragma that can be used to generate non-temporal stores. For example, I can write

void square(const double* x, double* y, int n) {#pragma vector nontemporal  for (int i=0; i<n; ++i) {    y[i] = x[i] * x[i];  }}

and ICC will generate instructions like this (compiler-explorer)

...  vmovntpd %ymm1, (%rsi,%r9,8) #4.5...

Do gcc and clang have anything similar? (other than intrinsics)

The non-temporal store makes the code much faster. Using this benchmark

#include <random>#include <memory>#include <benchmark/benchmark.h>static void generate_random_numbers(double* x, int n) {  std::mt19937 rng{0};  std::uniform_real_distribution<double> dist{-1, 1};  for (int i=0; i<n; ++i) {    x[i] = dist(rng);  }}static void square(const double* x, double* y, int n) {#ifdef __INTEL_COMPILER#pragma vector nontemporal#endif  for (int i=0; i<n; ++i) {    y[i] = x[i] * x[i];  }}static void BM_Square(benchmark::State& state) {  const int n = state.range(0);  std::unique_ptr<double[]> xptr{new double[n]};  generate_random_numbers(xptr.get(), n);  for (auto _ : state) {    std::unique_ptr<double[]> yptr{new double[n]};    square(xptr.get(), yptr.get(), n);    benchmark::DoNotOptimize(yptr);  }}BENCHMARK(BM_Square)->Arg(1000000);BENCHMARK_MAIN();

the non-temporal code runs almost twice as fast on my machine. Here are the full results:

icc:

> icc -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main> ./a.out------------------------------------------------------------Benchmark                  Time             CPU   Iterations------------------------------------------------------------BM_Square/1000000     430889 ns       430889 ns         1372

clang:

> clang++ -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main> ./a.out------------------------------------------------------------Benchmark                  Time             CPU   Iterations------------------------------------------------------------BM_Square/1000000     781672 ns       781470 ns          820

gcc:

> g++-mp-10 -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main> ./a.out------------------------------------------------------------Benchmark                  Time             CPU   Iterations------------------------------------------------------------BM_Square/1000000     681684 ns       681533 ns          782

Note: clang has __builtin_nontemporal_store; but when I try it, it won't generate non-temporal instructions (compiler-explorer)

How to generate non-temporal instructions?

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...