Update: Blazingly Fast-er SIMD Checksums

Saturday, December 24, 2022

This is a quick update to yesterday’s post on using std::experimental::simd to speed up APFS Fletcher-64 calculations. It turns out that there were still some low-hanging optimizations that could be used to improve my code. I got better performance from my code by using a simple loop unrolling technique.

Here’s the new version of the function. Notice that the only difference is that I’m now calculating more data per iteration of the loop. I’m using a lambda here to avoid code de-duplication, but the compiler will gladly inline the code.

static uint64_t fletcher64_simd(std::span<const uint32_t, 1024> words) {
  vu64 sum1{};
  sum1[0] = -(static_cast<uint64_t>(words[0]) + words[1]);

  vu64 sum2{};
  sum2[0] = words[1];

  const auto calc = [&](size_t n) {
    sum2 += vu32::size() * sum1;

    const vu64 all{reinterpret_cast<const uint64_t*>(std::addressof(words[n])),
                  stdx::vector_aligned};

    const vu64 evens = all & max32;
    const vu64 odds = all >> 32;

    sum1 += evens + odds;
    sum2 += evens * even_m + odds * odd_m;
  };

  for (size_t n = 0; n < words.size(); n += vu32::size()) {
    calc(n);
    calc(n += vu32::size());
    calc(n += vu32::size());
    calc(n += vu32::size());
    calc(n += vu32::size());
    calc(n += vu32::size());
    calc(n += vu32::size());
    calc(n += vu32::size());
  }

  // Fold the 64-bit overflow back into the 32-bit value
  const auto fold = [&](uint64_t x) {
    x = (x & max32) + (x >> 32);
    return (x == max32) ? 0 : x;
  };

  const uint64_t low = fold(stdx::reduce(sum1));
  const uint64_t high = fold(stdx::reduce(sum2));

  const uint64_t ck_low = max32 - ((low + high) % max32);
  const uint64_t ck_high = max32 - ((low + ck_low) % max32);

  return ck_low | ck_high << 32;
}

Updated Results

Here are the updated relative performance statistics with the updated code running on the same hardware as yesterday’s tests. Amazing!

Target Architecture	Time per Checksum	Throughput	Speedup
SSE	217ns	17.5543 GiB/s	3.4x
AVX2	105ns	36.2421 GiB/s	7x
AVX-512	75ns	50.7305 GiB/s	9.7x
NEON	171ns	22.273 GiB/s	2.7x

Find an issue or technical inaccuracy in this post? Please file an issue so that it may be corrected.