We need to use __float128 in c++ provided by gcc on Linux for some high performance code with matrices that have extremely bad condition number (so we can still have 16 non-noise digits in the solution when 16 have been lost in the algebraic solve).
We would like to further speedup certain slow sections of the code, so here a code snippet that captures the slowness: it takes 1.2 seconds to run on my laptop:
__float128 val = 1.00001q;__float128 finalval = 0;for (int i = 0; i < 729 * 2100 * 64; i++) finalval += val;std::cout << "> " << finalval << std::endl;
There is no RAM access here whatsoever, just CPU registers. Yes __float128 is done in software since the hardware (Intel i7) doesn't support it but still... this is 80x slower than in double, although the heavyset operation is a sum, which I read is supposed to be only 10x slower than in double precision in some benchmarks.
Does anyone have an idea on how to speed this up? (simd, compiler options,...)