ppad-sha256 and ppad-sha512 are two libraries implementing the corresponding well-known SHA2 hashing algorithms and HMACs, defined in RFC6234 and RFC2104 respectively. They contain highly-optimised pure Haskell implementations, based originally on the venerable SHA package from Hackage and tuned significantly for performance and throughput.
These pure implementations are no slouches. When compiled with optimisations and GHC’s LLVM backend, the two libraries yield the following Criterion benchmark figures for producing a digest and HMAC, based on a 32B input, for example:
(ppad-sha256)
benchmarking ppad-sha256/SHA256 (32B input)/hash
time 150.2 ns (150.0 ns .. 150.4 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 150.0 ns (149.7 ns .. 150.2 ns)
std dev 918.9 ps (700.4 ps .. 1.190 ns)
benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac
time 573.7 ns (572.7 ns .. 574.4 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 571.9 ns (571.1 ns .. 572.7 ns)
std dev 2.796 ns (2.516 ns .. 3.214 ns)
(ppad-sha512)
benchmarking ppad-sha512/SHA512 (32B input)/hash
time 201.6 ns (201.1 ns .. 202.1 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 200.9 ns (200.5 ns .. 201.3 ns)
std dev 1.409 ns (1.114 ns .. 1.784 ns)
benchmarking ppad-sha512/HMAC-SHA512 (32B input)/hmac
time 867.1 ns (866.0 ns .. 868.3 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 866.4 ns (865.8 ns .. 867.3 ns)
std dev 2.440 ns (2.056 ns .. 2.956 ns)
Whilst the algorithms’ runtimes necessarily vary with input length, allocation is held constant:
(ppad-sha256)
hash
Case Allocated GCs
hash (32B input) 192 0
hash (64B input) 192 0
hash (128B input) 192 0
hash (12288B input) 192 0
hmac
Case Allocated GCs
hmac (32B input) 560 0
hmac (64B input) 560 0
hmac (128B input) 560 0
hmac (12288B input) 560 0
(ppad-sha512)
hash
Case Allocated GCs
hash (32B input) 256 0
hash (64B input) 256 0
hash (128B input) 256 0
hash (12288B input) 256 0
hmac
Case Allocated GCs
hmac (32B input) 560 0
hmac (64B input) 560 0
hmac (128B input) 560 0
hmac (12288B input) 560 0
So they are quite performant as-is; likely near, if not at, the limit of what can be achieved in pure Haskell. But modern processors, both in the ARM and x86 architecture families, support primitive hardware instructions for SHA2 computations, so a substantial performance boost can be achieved by taking advantage of those, where available.
To be precise, on aarch64-darwin one can access machine instructions for SHA256 via the +sha2 feature flag, and, somewhat counterintuitively, for SHA512 via the +sha3 flag (or for both and more, e.g. for AES, with +crypto, for the full suite of cryptographic extensions). The instructions enabled by +sha2, for example, are:
A full SHA256 block computation involves 16 sets of each of those instructions applied in turn, constituting 64 updates in total. A series of such instructions in ARM assembly will literally look like, for some registers:
sha256h.4s q17, q2, v7
sha256h2.4s q2, q16, v7
sha256su0.4s v5, v4
sha256su1.4s v5, v3, v6
To access these instructions from Haskell, one needs to supply a C shim and then invoke it via the foreign function interface. One can then get at the instructions via wrappers supplied by the ARM NEON headers:
#if defined(__aarch64__) && defined(__ARM_FEATURE_SHA2)
#include <arm_neon.h>
[..]
void sha256_block_arm(uint32_t *state, const uint8_t *block) {
[..]
/* Rounds 0-3 */
tmp = vaddq_u32(m0, vld1q_u32(&K[0]));
tmp2 = abcd;
abcd = vsha256hq_u32(abcd, efgh, tmp);
efgh = vsha256h2q_u32(efgh, tmp2, tmp);
m0 = vsha256su1q_u32(vsha256su0q_u32(m0, m1), m2, m3);
/* Rounds 4-7 */
tmp = vaddq_u32(m1, vld1q_u32(&K[4]));
tmp2 = abcd;
abcd = vsha256hq_u32(abcd, efgh, tmp);
efgh = vsha256h2q_u32(efgh, tmp2, tmp);
m1 = vsha256su1q_u32(vsha256su0q_u32(m1, m2), m3, m0);
[..]
And the FFI import is then a simple:
foreign import ccall unsafe "sha256_block_arm"
c_sha256_block :: Ptr Word32 -> Ptr Word8 -> IO ()
One can then use 'c_sha256_block' to perform the
expensive SHA256 block computation in a Haskell implementation, availing
of direct hardware acceleration when compiling with the
'-march=armv8-a+sha2' option. Doing so yields the following
speedups:
(ppad-sha256)
benchmarking ppad-sha256/SHA256 (32B input)/hash
time 47.94 ns (47.85 ns .. 48.01 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 47.87 ns (47.80 ns .. 47.94 ns)
std dev 223.9 ps (163.2 ps .. 357.3 ps)
benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac
time 192.2 ns (192.1 ns .. 192.3 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 192.6 ns (192.4 ns .. 192.9 ns)
std dev 720.2 ps (468.6 ps .. 1.249 ns)
(ppad-sha512)
benchmarking ppad-sha512/SHA512 (32B input)/hash
time 105.6 ns (105.3 ns .. 105.9 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 105.2 ns (105.1 ns .. 105.4 ns)
std dev 550.2 ps (430.7 ps .. 732.9 ps)
benchmarking ppad-sha512/HMAC-SHA512 (32B input)/hmac
time 457.9 ns (457.7 ns .. 458.2 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 458.0 ns (457.8 ns .. 458.3 ns)
std dev 792.6 ps (563.5 ps .. 1.169 ns)
Or, for a direct comparison:
| Benchmark | Before | After | Speedup |
|-------------|----------|----------|---------|
| SHA256 hash | 150.2 ns | 47.94 ns | 3.1x |
| HMAC-SHA256 | 573.7 ns | 192.2 ns | 3.0x |
| SHA512 hash | 201.6 ns | 105.6 ns | 1.9x |
| HMAC-SHA512 | 867.1 ns | 457.9 ns | 1.9x |
Very fast! By using these instructions, computing a full SHA256 digest is only on the order of 10x the cost of a simple 256-bit word addition, for example. But it’s also a testament to how optimised the existing pure implementations in these libraries are: even by way of hardware acceleration, they only reap a small integer multiple gain in performance.
The latest versions of ppad-sha256 and ppad-sha512 (v0.2.5 and v0.1.5, respectively) use these native instructions by default, if they’re accessible, and fall back to the pure implementations otherwise.