Security Analysis, ppad-fixed.

Abstract

A formalization and analysis of the security model established by the ppad-fixed library.

Overview

ppad-fixed is a high-performance fixed-width unsigned word library supporting constant-time arithmetic & other operations, including on two Montgomery domains related to the elliptic curve secp256k1. It is intended to be used in security-critical cryptographic applications, e.g. for the implementation of elliptic curve arithmetic or message authentication code construction that must be performed in constant time.

Threat Model

This analysis assumes an adversary capable of:

As a purely software implementation, a trusted execution environment at the physical layer is assumed; considered out-of-scope are adversaries who can perform attacks over power or electromagnetic emanation-related side channels, speculative execution attacks, and so on.

Security Property

The security property to be demonstrated is that library functions “run in constant time,” unless marked otherwise (e.g. a number of functions suffixed with the tag ‘vartime’, which are intended to run in variable time).

Put more formally: the observable execution of any library function, when called with differing secret inputs, should be indistinguishable to an attacker with the capabilities described above.

Consider the observable execution trace t(f, x) procured by an attacker when evaluating the library function f on inputs x, where the trace consists of the sequence of instructions executed, memory addresses accessed, and the size and count of allocations made. Then for any secret inputs s0 and s1, and any public input p, the claim is, up to renaming of base pointers and stack addresses, that:

  t(f, (s0, p)) == t(f, (s1, p)).

Analysis

We defend the claimed constant-time security property by means of a number of evidence layers. First, by examination of the data representation and algorithm choice, then by analysis of its execution semantics at the compiler level, and then by analysis of the produced assembly. Finally, we’ll demonstrate that empirical wall clock benchmarks and allocation measurements fail to refute the claim.

The analysis performed here is specifically for ppad-fixed v0.1.3, compiled with optimizations using the LLVM backend on GHC 9.10.3 and LLVM 19.1.7, targeting aarch64-darwin (also sometimes referred to as darwin-arm64).

Data Representation

ppad-fixed defines two and four-limb word types (i.e., 128 and 256-bit words, on a 64-bit system), consisting of strict, unboxed (register-allocated) tuples inlined into a heap-allocated data constructor.

The ‘Wider’ four-limb word type, for example, is defined as:

  type Limb4 = (# Limb, Limb, Limb, Limb #)

  data Wider = Wider !Limb4

A ‘Limb’ here is a newtype over an unboxed word, i.e. ‘Word#’, which, again, is register-allocated. Such a ‘Wider’ allocates precisely 40 bytes on a 64-bit system: 8 bytes for each limb, and another 8 bytes for the ‘Wider’ data constructor itself. Library functions that return a ‘Wider’ value should allocate 40 bytes exactly, no matter what intermediate calculations they perform.

Operations on individual ‘Limbs’ use available primitives from GHC.Exts, e.g. the following carrying addition function, and otherwise are written essentially from scratch using a minimal set of standard library dependencies:

  add_c#
    :: Limb  -- ^ augend
    -> Limb  -- ^ addend
    -> Limb  -- ^ carry
    -> Limb2 -- ^ (# sum, new carry #)
  add_c# (Limb a) (Limb b) (Limb c) =
    let !(# c0, s0 #) = Exts.plusWord2# a b
        !(# c1,  s #) = Exts.plusWord2# s0 c
    in  (# Limb s, Limb (Exts.or# c0 c1) #)
  {-# INLINE add_c# #-}

The library provides two API’s for each word type; each function has an “unboxed variant” that works with unboxed tuples, avoiding all heap allocation whatsoever, and a “boxed variant” that allocates data constructors only at the end of the computation, simply to box up the result. Internal computations are always performed exclusively in terms of unboxed tuples.

For example, building on the above carrying addition function on single limbs, there is the following overflowing addition function on four-limb tuples:

  add_o#
    :: Limb4             -- ^ augend
    -> Limb4             -- ^ addend
    -> (# Limb4, Limb #) -- ^ (# sum, carry bit #)
  add_o# (# a0, a1, a2, a3 #) (# b0, b1, b2, b3 #) =
    let !(# s0, c0 #) = L.add_o# a0 b0
        !(# s1, c1 #) = L.add_c# a1 b1 c0
        !(# s2, c2 #) = L.add_c# a2 b2 c1
        !(# s3, c3 #) = L.add_c# a3 b3 c2
    in  (# (# s0, s1, s2, s3 #), c3 #)
  {-# INLINE add_o# #-}

and then the boxed variant, which uses the above function internally:

  add_o
    :: Wider
    -> Wider
    -> (Wider, Word)
  add_o (Wider a) (Wider b) =
    let !(# s, Limb c #) = add_o# a b
    in  (Wider s, W# c)

The data representation ensures that there should be absolutely no input-dependent variability in allocation when producing values. Using the “unboxed API” internally means that allocation should occur solely when boxing results, and nowhere else.

Algorithmic Semantics

All functions, unless marked as executing in variable time, make exclusive use of “straight-line” code that avoids any branching or indexing on inputs. The Data.Choice module implements a number of constant-time primitives for selection and logical operations, and these are used throughout the rest of the library.

Loops are often unrolled where feasible to do so. The implementation of modular inverse, for example, contains an unrolled loop of Montgomery squares and multiplies that proceeds according to a fixed schedule:

  -- generated by etc/generate_inv.sh
  inv#
    :: Limb4
    -> Limb4
  inv# a =
    let -- montgomery 'one'
        !t0 = (# Limb 0x1000003D1##, Limb 0##, Limb 0##, Limb 0## #)
        !t1 = sqr# t0
        !t2 = mul# a t1
        !t3 = sqr# t2
        !t4 = mul# a t3
        --
        -- many intermediate steps omitted..
        --
        !t502 = mul# a t501
        !t503 = sqr# t502
        !t504 = sqr# t503
        !t505 = mul# a t504
        !r = t505
    in  r
  {-# INLINE inv# #-}

Since the schedule of multiplies and squares is fixed, then, by construction, the whole sequence runs in constant time, assuming that ‘sqr#’ and ‘mul#’ also do.

As another example, the Montgomery exponentiation function is implemented as follows:

  exp#
    :: Limb4
    -> Limb4
    -> Limb4
  exp# b e =
    let !o = (# Limb 0x1000003D1##, Limb 0##, Limb 0##, Limb 0## #)
        loop !r !m !ex n = case n of
          0 -> r
          _ ->
            let !(# ne, bit #) = Wider.shr1_c# ex
                !candidate = mul# r m
                !nr = select# r candidate bit
                !nm = sqr# m
            in  loop nr nm ne (n - 1)
    in  loop o b e (256 :: Word)
  {-# INLINE exp# #-}

This one has no fixed schedule, and so it can’t be known at compile time how many ‘mul#’ and ‘sqr#’ calls will be needed; a ‘mul#’ call is introduced when, looping through the bits of the exponent, the least-significant bit is set. Here, then, instead of computing one or the other branch depending on the LSB, which would introduce argument-dependent variability to the function’s runtime, both branches are computed, and a constant-time selection is performed using ‘select#’ to continue the computation with the appropriate value.

Another important algorithmic point of note is that memory access patterns are always unconditional and input-independent, with no early exit. Observe the overflowing addition function again, for example:

  add_o#
    :: Limb4             -- ^ augend
    -> Limb4             -- ^ addend
    -> (# Limb4, Limb #) -- ^ (# sum, carry bit #)
  add_o# (# a0, a1, a2, a3 #) (# b0, b1, b2, b3 #) =
    let !(# s0, c0 #) = L.add_o# a0 b0
        !(# s1, c1 #) = L.add_c# a1 b1 c0
        !(# s2, c2 #) = L.add_c# a2 b2 c1
        !(# s3, c3 #) = L.add_c# a3 b3 c2
    in  (# (# s0, s1, s2, s3 #), c3 #)
  {-# INLINE add_o# #-}

Note here that the (# a0, a1, a2, a3 #) match extracts all limbs directly, and this pattern is used throughout the codebase.

Evaluation Semantics (STG)

The library’s evaluation semantics, after partial compilation and optimization by GHC, can be analysed by way of a STG dump (where STG, for “Spineless Tagless G-machine,” is GHC’s intermediate representation that makes operational semantics explicit).

In STG, we have that:

Consider the overflowing addition function on ‘Wider’ values shown previously, for example. First, the unboxed variant, as STG:

  add_o#
    :: (# Limb, Limb, Limb, Limb #)
       -> (# Limb, Limb, Limb, Limb #)
       -> (# (# Limb, Limb, Limb, Limb #), Limb #) =
      \r [us us us us us us us us]
          case plusWord2# [us us] of {
          (#,#) c s ->
          case plusWord2# [us us] of {
          (#,#) c0 s0 ->
          case plusWord2# [s0 c] of {
          (#,#) c1 s1 ->
          case plusWord2# [us us] of {
          (#,#) c2 s2 ->
          case or# [c0 c1] of sat {
          __DEFAULT ->
          case plusWord2# [s2 sat] of {
          (#,#) c3 s3 ->
          case plusWord2# [us us] of {
          (#,#) c4 s4 ->
          case or# [c2 c3] of sat {
          __DEFAULT ->
          case plusWord2# [s4 sat] of {
          (#,#) c5 s5 ->
          case or# [c4 c5] of sat {
          __DEFAULT -> (#,,,,#) [s s1 s3 s5 sat];
          [..]
          };

Notice it contains no let bindings, and thus doesn’t allocate (recall raw limbs are stored in registers), and also that each case expression has only a single branch, such that each represents straightforward evaluation sequencing.

For the boxed wrapper, in which the above unboxed function has already been inlined:

  add_o :: Wider -> Wider -> (Wider, Word) =
      \r [ds ds1]
          case ds of {
          Wider us us us us ->
          case ds1 of {
          Wider us us us us ->
          case plusWord2# [us us] of {
          (#,#) c s ->
          case plusWord2# [us us] of {
          (#,#) c0 s0 ->
          case plusWord2# [s0 c] of {
          (#,#) c1 s1 ->
          case plusWord2# [us us] of {
          (#,#) c2 s2 ->
          case or# [c0 c1] of sat {
          __DEFAULT ->
          case plusWord2# [s2 sat] of {
          (#,#) c3 s3 ->
          case plusWord2# [us us] of {
          (#,#) c4 s4 ->
          case or# [c2 c3] of sat {
          __DEFAULT ->
          case plusWord2# [s4 sat] of {
          (#,#) c5 s5 ->
          case or# [c4 c5] of sat {
          __DEFAULT ->
          let { sat :: Word = W#! [sat]; } in
          let { sat :: Wider = Wider! [s s1 s3 s5]; } in  (,) [sat sat];
          [..]
          };

This function returns a boxed value, so it allocates, as can be seen by the let bindings at the end. This function returns a value of type (Wider, Word), so it needs to allocate precisely one ‘Wider’ data constructor, one ‘Word’ data constructor, and one tuple constructor, all of which is constant with respect to the function’s inputs; there is no additional allocation to be found anywhere.

As another example, the STG for the loop in the exponentiation function shown previously is as follows:

  $wloop
    :: (# Limb, Limb, Limb, Limb #)
       -> (# Limb, Limb, Limb, Limb #)
       -> (# Limb, Limb, Limb, Limb #)
       -> Word#
       -> (# Limb, Limb, Limb, Limb #) =
      \r [us us us us us us us us us us us us ww]
          case ww<TagProper> of wild2 {
            __DEFAULT ->
                case sqr# us us us us of nm {
                (#,,,#) _ _ _ _ ->
                case mul# us us us us us us us us of {
                (#,,,#) ipv4 ipv5 ipv6 ipv7 ->
                case uncheckedShiftL# [us 63#] of sat {
                __DEFAULT ->
                case uncheckedShiftRL# [sat 63#] of sat {
                __DEFAULT ->
                case not# [sat] of sat {
                __DEFAULT ->
                case plusWord# [sat 1##] of ipv8 {
                __DEFAULT ->
                case minusWord# [wild2 1##] of sat {
                __DEFAULT ->
                case xor# [us ipv7] of sat {
                __DEFAULT ->
                case and# [ipv8 sat] of sat {
                __DEFAULT ->
                case xor# [us sat] of sat {
                __DEFAULT ->
                [..]
                case or# [sat sat] of sat {
                __DEFAULT ->
                exp_$s$wloop1
                    sat sat sat sat ipv ipv1 ipv2 ipv3 sat sat sat sat sat;
                };
                [..]
                };
            0## -> (#,,,#) [us us us us];
          };

This contains no let bindings at all, but two branches are evident in the first case expression. These correspond to the decision of whether or not to continue or stop the loop (it stops when the looping index hits 0), which is deterministic and constant with respect to inputs. Note that the constant-time select operation used in the loop produces no such branching, but instead is captured by the salad of ‘xor#’ and ‘and#’ expressions in the body of the function (many omitted above).

For an example of something that does not run in constant time, consider the STG for the eq_vartime function, which compares for equality in variable time:

  eq_vartime :: Montgomery -> Montgomery -> Bool =
      \r [ds ds1]
          case ds of {
          Montgomery us us us us ->
          case ds1 of {
          Montgomery us us us us ->
          case eqWord# [us us] of {
            __DEFAULT -> False [];
            1# ->
                case eqWord# [us us] of {
                  __DEFAULT -> False [];
                  1# ->
                      case eqWord# [us us] of {
                        __DEFAULT -> False [];
                        1# -> eq_vartime# us us;
                      };
                };
          };
          };
          };

Here there are a number of branches, each of which depends on the inputs to the function, which indicates the expected failure to run in constant time.

Assembly

To inspect the aarch64-darwin assembly produced by GHC’s LLVM backend, we can dump the LLVM IR and then run it through ‘llc’ with optimizations.

The overflowing addition function mentioned previously compiles to the following assembly (entirety pasted here, for completeness):

  ; %bb.0:                                ; %naoh
    stp x25, x26, [sp, #40]
    stp x23, x24, [sp, #24]
    stp x20, x22, [sp, #8]
    ldp x8, x9, [x20]
    stp x9, x8, [sp, #176]
    ldr x10, [x20, #16]
    str x10, [sp, #168]
    adds  x8, x24, x8
    cset  w10, hs
    adds  x11, x23, x27
    cset  w12, hs
    stp x11, x12, [sp, #152]
    stp x8, x10, [sp, #136]
    adcs  x8, x8, xzr
    cset  w11, hs
    stp x8, x11, [sp, #120]
    adds  x8, x25, x9
    cset  w9, hs
    stp x8, x9, [sp, #104]
    orr x10, x10, x11
    adds  x24, x8, x10
    cset  w8, hs
    stp x24, x8, [sp, #88]
    ldr x10, [sp, #48]
    ldr x11, [sp, #168]
    adds  x10, x10, x11
    cset  w11, hs
    stp x10, x11, [sp, #72]
    orr x8, x9, x8
    adds  x25, x10, x8
    cset  w8, hs
    stp x25, x8, [sp, #56]
    orr x26, x11, x8
    stp x25, x26, [sp, #40]
    ldr x23, [sp, #120]
    stp x23, x24, [sp, #24]
    ldr x22, [sp, #152]
    ldr x8, [sp, #8]
    add x20, x8, #24
    stp x20, x22, [sp, #8]
    ldr x8, [x8, #24]
    blr x8
    ret

The ‘str’ and ‘ldr’ instructions mean store register and load register, and ‘stp’ and ‘ldp’ mean store pair of registers and load pair of registers, respectively. They all indicate data being shuffled around between registers and the stack. Per an Arm Reference, each of these instructions runs in constant time:

This instruction is a data-independent-time instruction

Moreover, the base registers used by each of these store and load instructions are public values, not derived from secrets. Consider

    ldp x8, x9, [x20]

which loads from the memory address stored in x20, a public register used by GHC’s calling convention to hold an STG stack pointer, into the registers x8 and x9 (see the “ARMv8/AArch64 ABI register mapping” section for the correspondence). Access to this register is strictly handled by the runtime, and isn’t decided based on function inputs. Other load and store instructions have the form of e.g.:

    stp x25, x26, [sp, #40]

which stores the contents of the registers x25 and x26 at the address determined by ‘sp’, a stack pointer, plus the fixed, immediate offset 40, which is a compile-time constant. These register access patterns are unaffected by secrets, and thus are not exploitable by an adversary that can observe cache timing.

The branch with link to register (blr) and return from subroutine (ret) instructions at the end of the computation are “whole program” control-flow instructions that branch unconditionally, and not on secret data. BLR’s target register contains a GHC continuation pointer and RET indicates that this is a subroutine return.

The actual overflowing addition algorithm is better-captured by the following core sequence of instructions, with the above “administrative” instructions omitted:

    adds  x8, x24, x8
    cset  w10, hs
    adds  x11, x23, x27
    cset  w12, hs
    adcs  x8, x8, xzr
    cset  w11, hs
    adds  x8, x25, x9
    cset  w9, hs
    orr x10, x10, x11
    adds  x24, x8, x10
    cset  w8, hs
    adds  x10, x10, x11
    cset  w11, hs
    orr x8, x9, x8
    adds  x25, x10, x8
    cset  w8, hs
    orr x26, x11, x8
    add x20, x8, #24

The add (add), add, setting flags (adds), add with carry, setting flags (adcs), and bitwise or (orr) instructions are all “data-independent-time,” and “cset” is an alias for conditional select increment (csinc), which is a constant-time branchless select.

It’s also worth examining the assembly for the variable-time equality function (“administrative” instructions omitted):

  ; %bb.0:
    cmp x8, x9
    b.ne  LBB31_4
  ; %bb.1:
    cmp x8, x9
    b.ne  LBB31_4
  ; %bb.2:
    cmp x8, x9
    b.ne  LBB31_4
  ; %bb.3:
    add x20, x8, #40

The core variable-time instruction here is branch conditionally (b.cond), where the condition of interest is “not equal to” (thus ‘b.ne’). This returns the subroutine early if a limb inequality check (performed by ‘cmp’) succeeds. If the subroutine can be timed reliably with chosen inputs, then the timing differential will expose which limb differs first.

Empirical

ppad-fixed has a Criterion benchmark suite that produces statistical execution time benchmarks on varying-sized inputs, such that variation in wall-clock time should be readily attributable to noise.

Here’s an example excerpt of the output, for a Montgomery modular exponentiation function:

  benchmarking exp/curve:  M(2) ^ 2
  time                 5.217 μs   (5.206 μs .. 5.224 μs)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 5.208 μs   (5.198 μs .. 5.215 μs)
  std dev              27.24 ns   (20.87 ns .. 37.18 ns)

  benchmarking exp/curve:  M(2 ^ 255 - 19) ^ (2 ^ 255 - 19)
  time                 5.214 μs   (5.207 μs .. 5.220 μs)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 5.211 μs   (5.201 μs .. 5.217 μs)
  std dev              27.82 ns   (16.79 ns .. 51.12 ns)

With a three-nanosecond average differential, the function runs in the same time (up to noise) when called with either small and large inputs.

A ‘weigh’ benchmark suite similarly measures allocation performed by each function. The Montgomery modular exponentiation function measures as follows, for example:

  exp

    Case                            Allocated  GCs
    curve:   M(2) ^ 2                      40    0
    curve:   M(2) ^ (2 ^ 255 - 19)         40    0
    scalar:  M(2) ^ 2                      40    0
    scalar:  M(2) ^ (2 ^ 255 - 19)         40    0

Recall that 40 bytes is what a four-limb strict, unboxed tuple behind a single constructor should allocate on a 64-bit system.

Any additional allocation is accounted for in functions that return values at more complicated types. The variable-time modular square root function on one of the Montgomery domains, for example, allocates as follows:

  sqrt_vartime

    Case                                  Allocated  GCs
    curve:  sqrt_vartime M(2)                    56    0
    curve:  sqrt_vartime M(2 ^ 255 - 19)         56    0

Here the function returns a ‘Maybe Montgomery’ value, so the additional 16 bytes are for the ‘Just’ constructor and the pointer to its payload.

Conclusion

By our appraisal, all functions in this library will run in constant time on aarch64-darwin when compiled with optimizations with GHC 9.10.3 and LLVM 19.1.7, unless marked otherwise.