Quantcast
Channel: Active questions tagged gcc - Stack Overflow
Viewing all articles
Browse latest Browse all 22001

Single multiplication makes code ~10 slower in fastmod implementation

$
0
0

Here is a fastmod wrapper and two operators I've implemented for using it:

class fastmod
{
public:
  [[using gnu: cold]] fastmod(uint64_t denominator) : denominator_(denominator) 
  { 
    M_ = static_cast<__uint128_t>(-1) / denominator + 1; 
  }

private:
  friend uint64_t operator/(uint64_t, fastmod const&);
  friend uint64_t operator%(uint64_t, fastmod const&);

private:
  uint64_t denominator_;
  __uint128_t M_;
};

[[using gnu: hot, always_inline]] inline uint64_t operator/(uint64_t numerator, fastmod const&     divisor)
{
  return ((divisor.M_ & 0xFFFFFFFFFFFFFFFFULL) * numerator >> 64ULL) + ((divisor.M_ >> 64ULL) * numerator >> 64ULL);
}

[[using gnu: hot, always_inline]] inline uint64_t operator%(uint64_t numerator, fastmod const& divisor)
{
  __uint128_t magic = divisor.M_ * numerator;
  return ((magic & 0xFFFFFFFFFFFFFFFFULL) * divisor.denominator_ >> 64ULL) + ((magic >> 64ULL) * divisor.denominator_ >> 64ULL);
}

Benchmarking this against normal integer division, I see the following results (compiled with gcc9.1 -O3 -std=c++17 -c):

-----------------------------------------------------------------
Benchmark              Time             CPU   Iterations
-----------------------------------------------------------------
div_bench           6.19 ns         6.17 ns    113415439
mod_bench           6.18 ns         6.17 ns    113532739
fastdiv_bench       2.06 ns         2.06 ns    340581374
fastmod_bench       25.8 ns         25.7 ns     27230411

With the benchmarking code being

std::mt19937_64 mt;
fastdiv denominator = mt() % (1ULL << 27);

for(auto _ : state)
{
  auto numerator = mt();
  benchmark::DoNotOptimize(numerator % denominator);
}

for fastmod_bench and

std::mt19937_64 mt;
fastmod denominator = mt() % (1ULL << 27);

for(auto _ : state)
{
  auto numerator = mt();
  benchmark::DoNotOptimize(numerator / denominator);
}

for fastdiv_bench.

The fastmod_bench result is not what I expected. I strongly suspect that this is due to the line

__uint128_t magic = divisor.M_ * numerator;

since if I remove this and replace magic with just numerator, the new result becomes

fastmod_bench       1.70 ns         1.70 ns    412247047

which is over 10x faster than with the multiplication.

I'm wondering why with the introduction of this multiplication, this code runs ~4x slower than even regular integer division. I expected my benchmark to be in the 2ns range, along with the benchmark for division.

I have this loaded into godbolt here: https://godbolt.org/z/4fcAQA, but I don't see anything that quickly explains why the performance nukes because of the multiplication.


Unfortunately, it looks like this was a problem with my benchmark; the number generation must be interacting with the modulus somehow. Here's a new benchmark, same compiler/options:

std::mt19937_64 mt;
fastmod denominator = mt() % (1ULL << 27);
uint64_t numerator{10001107};
constexpr uint64_t transform{10005731};

for(auto _ : state)
{
  numerator *= transform;
  benchmark::DoNotOptimize(numerator % denominator);
}

(Similarly for division.)

I get the following results now:

-----------------------------------------------------------------
Benchmark              Time             CPU   Iterations
-----------------------------------------------------------------
div_bench           5.22 ns         5.20 ns    134608758
mod_bench           5.22 ns         5.20 ns    134627064
fastmod_bench       1.30 ns         1.30 ns    540373271
fastdiv_bench       1.30 ns         1.30 ns    540333224

Viewing all articles
Browse latest Browse all 22001

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>