Here is a fastmod wrapper and two operators I've implemented for using it:
class fastmod
{
public:
[[using gnu: cold]] fastmod(uint64_t denominator) : denominator_(denominator)
{
M_ = static_cast<__uint128_t>(-1) / denominator + 1;
}
private:
friend uint64_t operator/(uint64_t, fastmod const&);
friend uint64_t operator%(uint64_t, fastmod const&);
private:
uint64_t denominator_;
__uint128_t M_;
};
[[using gnu: hot, always_inline]] inline uint64_t operator/(uint64_t numerator, fastmod const& divisor)
{
return ((divisor.M_ & 0xFFFFFFFFFFFFFFFFULL) * numerator >> 64ULL) + ((divisor.M_ >> 64ULL) * numerator >> 64ULL);
}
[[using gnu: hot, always_inline]] inline uint64_t operator%(uint64_t numerator, fastmod const& divisor)
{
__uint128_t magic = divisor.M_ * numerator;
return ((magic & 0xFFFFFFFFFFFFFFFFULL) * divisor.denominator_ >> 64ULL) + ((magic >> 64ULL) * divisor.denominator_ >> 64ULL);
}
Benchmarking this against normal integer division, I see the following results (compiled with gcc9.1 -O3 -std=c++17 -c
):
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
div_bench 6.19 ns 6.17 ns 113415439
mod_bench 6.18 ns 6.17 ns 113532739
fastdiv_bench 2.06 ns 2.06 ns 340581374
fastmod_bench 25.8 ns 25.7 ns 27230411
With the benchmarking code being
std::mt19937_64 mt;
fastdiv denominator = mt() % (1ULL << 27);
for(auto _ : state)
{
auto numerator = mt();
benchmark::DoNotOptimize(numerator % denominator);
}
for fastmod_bench
and
std::mt19937_64 mt;
fastmod denominator = mt() % (1ULL << 27);
for(auto _ : state)
{
auto numerator = mt();
benchmark::DoNotOptimize(numerator / denominator);
}
for fastdiv_bench
.
The fastmod_bench
result is not what I expected. I strongly suspect that this is due to the line
__uint128_t magic = divisor.M_ * numerator;
since if I remove this and replace magic
with just numerator
, the new result becomes
fastmod_bench 1.70 ns 1.70 ns 412247047
which is over 10x faster than with the multiplication.
I'm wondering why with the introduction of this multiplication, this code runs ~4x slower than even regular integer division. I expected my benchmark to be in the 2ns range, along with the benchmark for division.
I have this loaded into godbolt here: https://godbolt.org/z/4fcAQA, but I don't see anything that quickly explains why the performance nukes because of the multiplication.
Unfortunately, it looks like this was a problem with my benchmark; the number generation must be interacting with the modulus somehow. Here's a new benchmark, same compiler/options:
std::mt19937_64 mt;
fastmod denominator = mt() % (1ULL << 27);
uint64_t numerator{10001107};
constexpr uint64_t transform{10005731};
for(auto _ : state)
{
numerator *= transform;
benchmark::DoNotOptimize(numerator % denominator);
}
(Similarly for division.)
I get the following results now:
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
div_bench 5.22 ns 5.20 ns 134608758
mod_bench 5.22 ns 5.20 ns 134627064
fastmod_bench 1.30 ns 1.30 ns 540373271
fastdiv_bench 1.30 ns 1.30 ns 540333224