Single multiplication makes code ~10 slower in fastmod implementation

Here is a fastmod wrapper and two operators I've implemented for using it:

class fastmod
{
public:
  [[using gnu: cold]] fastmod(uint64_t denominator) : denominator_(denominator) 
  { 
    M_ = static_cast<__uint128_t>(-1) / denominator + 1; 
  }

private:
  friend uint64_t operator/(uint64_t, fastmod const&);
  friend uint64_t operator%(uint64_t, fastmod const&);

private:
  uint64_t denominator_;
  __uint128_t M_;
};

[[using gnu: hot, always_inline]] inline uint64_t operator/(uint64_t numerator, fastmod const&     divisor)
{
  return ((divisor.M_ & 0xFFFFFFFFFFFFFFFFULL) * numerator >> 64ULL) + ((divisor.M_ >> 64ULL) * numerator >> 64ULL);
}

[[using gnu: hot, always_inline]] inline uint64_t operator%(uint64_t numerator, fastmod const& divisor)
{
  __uint128_t magic = divisor.M_ * numerator;
  return ((magic & 0xFFFFFFFFFFFFFFFFULL) * divisor.denominator_ >> 64ULL) + ((magic >> 64ULL) * divisor.denominator_ >> 64ULL);
}

Benchmarking this against normal integer division, I see the following results (compiled with gcc9.1 -O3 -std=c++17 -c):

-----------------------------------------------------------------
Benchmark              Time             CPU   Iterations
-----------------------------------------------------------------
div_bench           6.19 ns         6.17 ns    113415439
mod_bench           6.18 ns         6.17 ns    113532739
fastdiv_bench       2.06 ns         2.06 ns    340581374
fastmod_bench       25.8 ns         25.7 ns     27230411

With the benchmarking code being

std::mt19937_64 mt;
fastdiv denominator = mt() % (1ULL << 27);

for(auto _ : state)
{
  auto numerator = mt();
  benchmark::DoNotOptimize(numerator % denominator);
}

for fastmod_bench and

std::mt19937_64 mt;
fastmod denominator = mt() % (1ULL << 27);

for(auto _ : state)
{
  auto numerator = mt();
  benchmark::DoNotOptimize(numerator / denominator);
}

for fastdiv_bench.

The fastmod_bench result is not what I expected. I strongly suspect that this is due to the line

__uint128_t magic = divisor.M_ * numerator;

since if I remove this and replace magic with just numerator, the new result becomes

fastmod_bench       1.70 ns         1.70 ns    412247047

which is over 10x faster than with the multiplication.

I'm wondering why with the introduction of this multiplication, this code runs ~4x slower than even regular integer division. I expected my benchmark to be in the 2ns range, along with the benchmark for division.

I have this loaded into godbolt here: https://godbolt.org/z/4fcAQA, but I don't see anything that quickly explains why the performance nukes because of the multiplication.

Unfortunately, it looks like this was a problem with my benchmark; the number generation must be interacting with the modulus somehow. Here's a new benchmark, same compiler/options:

std::mt19937_64 mt;
fastmod denominator = mt() % (1ULL << 27);
uint64_t numerator{10001107};
constexpr uint64_t transform{10005731};

for(auto _ : state)
{
  numerator *= transform;
  benchmark::DoNotOptimize(numerator % denominator);
}

(Similarly for division.)

I get the following results now:

-----------------------------------------------------------------
Benchmark              Time             CPU   Iterations
-----------------------------------------------------------------
div_bench           5.22 ns         5.20 ns    134608758
mod_bench           5.22 ns         5.20 ns    134627064
fastmod_bench       1.30 ns         1.30 ns    540373271
fastdiv_bench       1.30 ns         1.30 ns    540333224

Single multiplication makes code ~10 slower in fastmod implementation

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112