Using ARM cryptographic extensions with GHC

ppad-sha256 and ppad-sha512 are two libraries implementing the corresponding well-known SHA2 hashing algorithms and HMACs, defined in RFC6234 and RFC2104 respectively. They contain highly-optimised pure Haskell implementations, based originally on the venerable SHA package from Hackage and tuned significantly for performance and throughput.

These pure implementations are no slouches. When compiled with optimisations and GHC’s LLVM backend, the two libraries yield the following Criterion benchmark figures for producing a digest and HMAC, based on a 32B input, for example:

(ppad-sha256)

  benchmarking ppad-sha256/SHA256 (32B input)/hash
  time                 150.2 ns   (150.0 ns .. 150.4 ns)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 150.0 ns   (149.7 ns .. 150.2 ns)
  std dev              918.9 ps   (700.4 ps .. 1.190 ns)

  benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac
  time                 573.7 ns   (572.7 ns .. 574.4 ns)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 571.9 ns   (571.1 ns .. 572.7 ns)
  std dev              2.796 ns   (2.516 ns .. 3.214 ns)

(ppad-sha512)

  benchmarking ppad-sha512/SHA512 (32B input)/hash
  time                 201.6 ns   (201.1 ns .. 202.1 ns)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 200.9 ns   (200.5 ns .. 201.3 ns)
  std dev              1.409 ns   (1.114 ns .. 1.784 ns)

  benchmarking ppad-sha512/HMAC-SHA512 (32B input)/hmac
  time                 867.1 ns   (866.0 ns .. 868.3 ns)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 866.4 ns   (865.8 ns .. 867.3 ns)
  std dev              2.440 ns   (2.056 ns .. 2.956 ns)

Whilst the algorithms’ runtimes necessarily vary with input length, allocation is held constant:

(ppad-sha256)

  hash

    Case                 Allocated  GCs
    hash (32B  input)          192    0
    hash (64B  input)          192    0
    hash (128B input)          192    0
    hash (12288B input)        192    0

  hmac

    Case                 Allocated  GCs
    hmac (32B  input)          560    0
    hmac (64B  input)          560    0
    hmac (128B input)          560    0
    hmac (12288B input)        560    0

(ppad-sha512)

  hash

    Case                 Allocated  GCs
    hash (32B  input)          256    0
    hash (64B  input)          256    0
    hash (128B input)          256    0
    hash (12288B input)        256    0

  hmac

    Case                 Allocated  GCs
    hmac (32B  input)          560    0
    hmac (64B  input)          560    0
    hmac (128B input)          560    0
    hmac (12288B input)        560    0

So they are quite performant as-is; likely near, if not at, the limit of what can be achieved in pure Haskell. But modern processors, both in the ARM and x86 architecture families, support primitive hardware instructions for SHA2 computations, so a substantial performance boost can be achieved by taking advantage of those, where available.

To be precise, on aarch64-darwin one can access machine instructions for SHA256 via the +sha2 feature flag, and, somewhat counterintuitively, for SHA512 via the +sha3 flag (or for both and more, e.g. for AES, with +crypto, for the full suite of cryptographic extensions). The instructions enabled by +sha2, for example, are:

SHA256H, for the initial part of the SHA256 hash update,
SHA256H2, for the second part,
SHA256SU0, for the initial part of the SHA256 schedule update, and
SHA256SU1 for the second part.

A full SHA256 block computation involves 16 sets of each of those instructions applied in turn, constituting 64 updates in total. A series of such instructions in ARM assembly will literally look like, for some registers:

  sha256h.4s  q17, q2, v7
  sha256h2.4s q2, q16, v7
  sha256su0.4s  v5, v4
  sha256su1.4s  v5, v3, v6

To access these instructions from Haskell, one needs to supply a C shim and then invoke it via the foreign function interface. One can then get at the instructions via wrappers supplied by the ARM NEON headers:

  #if defined(__aarch64__) && defined(__ARM_FEATURE_SHA2)

  #include <arm_neon.h>

  [..]

  void sha256_block_arm(uint32_t *state, const uint8_t *block) {

  [..]

    /* Rounds 0-3 */
    tmp = vaddq_u32(m0, vld1q_u32(&K[0]));
    tmp2 = abcd;
    abcd = vsha256hq_u32(abcd, efgh, tmp);
    efgh = vsha256h2q_u32(efgh, tmp2, tmp);
    m0 = vsha256su1q_u32(vsha256su0q_u32(m0, m1), m2, m3);

    /* Rounds 4-7 */
    tmp = vaddq_u32(m1, vld1q_u32(&K[4]));
    tmp2 = abcd;
    abcd = vsha256hq_u32(abcd, efgh, tmp);
    efgh = vsha256h2q_u32(efgh, tmp2, tmp);
    m1 = vsha256su1q_u32(vsha256su0q_u32(m1, m2), m3, m0);

  [..]

And the FFI import is then a simple:

  foreign import ccall unsafe "sha256_block_arm"
    c_sha256_block :: Ptr Word32 -> Ptr Word8 -> IO ()

One can then use 'c_sha256_block' to perform the expensive SHA256 block computation in a Haskell implementation, availing of direct hardware acceleration when compiling with the '-march=armv8-a+sha2' option. Doing so yields the following speedups:

(ppad-sha256)

  benchmarking ppad-sha256/SHA256 (32B input)/hash
  time                 47.94 ns   (47.85 ns .. 48.01 ns)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 47.87 ns   (47.80 ns .. 47.94 ns)
  std dev              223.9 ps   (163.2 ps .. 357.3 ps)

  benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac
  time                 192.2 ns   (192.1 ns .. 192.3 ns)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 192.6 ns   (192.4 ns .. 192.9 ns)
  std dev              720.2 ps   (468.6 ps .. 1.249 ns)

(ppad-sha512)

  benchmarking ppad-sha512/SHA512 (32B input)/hash
  time                 105.6 ns   (105.3 ns .. 105.9 ns)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 105.2 ns   (105.1 ns .. 105.4 ns)
  std dev              550.2 ps   (430.7 ps .. 732.9 ps)

  benchmarking ppad-sha512/HMAC-SHA512 (32B input)/hmac
  time                 457.9 ns   (457.7 ns .. 458.2 ns)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 458.0 ns   (457.8 ns .. 458.3 ns)
  std dev              792.6 ps   (563.5 ps .. 1.169 ns)

Or, for a direct comparison:

  | Benchmark   | Before   | After    | Speedup |
  |-------------|----------|----------|---------|
  | SHA256 hash | 150.2 ns | 47.94 ns | 3.1x    |
  | HMAC-SHA256 | 573.7 ns | 192.2 ns | 3.0x    |
  | SHA512 hash | 201.6 ns | 105.6 ns | 1.9x    |
  | HMAC-SHA512 | 867.1 ns | 457.9 ns | 1.9x    |

Very fast! By using these instructions, computing a full SHA256 digest is only on the order of 10x the cost of a simple 256-bit word addition, for example. But it’s also a testament to how optimised the existing pure implementations in these libraries are: even by way of hardware acceleration, they only reap a small integer multiple gain in performance.

The latest versions of ppad-sha256 and ppad-sha512 (v0.2.5 and v0.1.5, respectively) use these native instructions by default, if they’re accessible, and fall back to the pure implementations otherwise.

(/)

(/about)

(/articles)

(/email)

(/git)

(/docs)