ED25519 on MSP430

Introduction

The ED25519 signature scheme is designed for high performance, and it is attractive for lower-end micro controllers, e.g, the MSP430. I haven't tried implementing it on that platform, but I'd like to provide some advice based on related projects. These notes are based on results from a prestudy done for Yubico AB a few years ago, doing elliptic curves on the MSP430, but with curves other than ed25519, and on the implementation of ECC signatures in GNU Nettle (originally funded by the .SE Internet fund), and my experience with low-level arithmetics in GNU GMP and GNU Nettle.

I'm using the MSP430 as a concrete example, but much of this should be applicable to other platforms.

This document is written by Niels Möller, 2018, and licenced under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Code excerpts are provided mainly for illustrative purposes, and most of the code is copyright Yubico AB. The examples may be copied and used under the terms of the GNU LGPL (Lesser general public license), version 2.1 or later.

EDDSA

Sign operation:

Point multiplication r G, with fixed base point G and r is a nonce derived from the secret key and the message.
Two SHA512 digests.
A few modulo q mul and add operations.

Verify operation:

Decompress, including modulo p sqrt.
One SHA512 digest.
Two point multiplications, one with fixed base point.

Main references are High-speed high-security signatures by Daniel J. Bernstein et al and RFC 8032.

Edwards curve 25519

Same curve as for curve25519, but with a coordinate change to treat it as a twisted Edwards curve rather than a Montgomery curve. Underlying field is defined by the prime p = 2²⁵⁵ - 19.

The curve equation is -x² + y² = 1 - d x² y², with d = 121665 / 121666 (mod p). There are several alternatives for coordinates and formulas..

MSP430

MSP430 is a family 16-bit microcontrollers, featuring 12 general purpose registers, and with several options for memory-mapped multiplication hardware. See User's Guide.

Base arithmetic

Operations

Let us represent bignums as arrays of 16-bit words (uint16_t in C). Functions will mostly follow GMP mpn-level conventions, in particular, we store bignums with least significant word first. We will need

Bignum addition, adding two numbers of n words, and returning carry out:
uint16_t mp_add_n(uint16_t *rp, const uint16_t *ap, const uint16_t *bp, size_t n);
Adding a bignum and a single word number
uint16_t mp_add_1(uint16_t *rp, const uint16_t *ap, size_t n, uint16_t b);
To avoid side channel leakage, must always propagate carry all the way, and not take any shortcuts for the common case that rp = ap (in-place operation) and no carry from the low word.
Conditional addition: sets R ← A + B if cnd is non-zero, otherwise sets R ← A. Implemented with masking, to be side-channel silent.
uint16_t mp_cnd_add_n(uint16_t cnd, uint16_t *rp, const uint16_t *ap, const uint16_t *bp, size_t n);
Corresponding subtraction functions.
Computing R ← R + A b, where R and A are bignums, and b is a 16-bit words, returning carry word.
uint16_t mp_addmul_1(uint16_t *rp, const uint16_t *ap, size_t n, uint16_t c);
Multiplying two numbers of an and bn words, respectively, producing the an+bn words of the product. And separate funcion for the special case of squaring.
void mp_mul(uint16_t *rp, const uint16_t *ap, size_t an, const uint16_t *bp, size_t bn);

void mp_sqr(uint16_t *rp, const uint16_t *ap, size_t n);

Implementation

The source examples here are excerpts from the LGPL licensed code from the Yubico AB study. Note that assembly code follows GMP and Nettle conventions with C as the comment character, and assembler files preprocessed using the m4 macro processor.

mp_add_n is an easy assembler loop,

define(<RP>, <r15>) 
define(<AP>, <r14>)
define(<BP>, <r13>)
define(<N>, <r12>)
C mp_add_n (rp, ap, bp, n)
	
mp_add_n:
	push	r9
	push	r8
	mov	#0, r8
	jmp	.Lnext
.Loop:
	C low bit of r8 is carry in and out
	bit	#1, r8	C get carry back
	mov	@AP+, r9
	addc	@BP+, r9
	mov	r9, 0(RP)
	rlc	r8	C save carry
	add	#2, RP
.Lnext:
	C Uses carry, so need to save and restore
	sub	#1, N
	jc	.Loop	C !borrow == carry

	mov	r8, r15
	and	#1, r15
	pop	r8
	pop	r9
	ret

This loop spends quite a few instructions on moving the carry back and forth between the status register and register r8. Unrolling for sizes of interest, in particular n = 16 for ed25519 and curve25519, might give a significant local speedup (but on the other hand, additions will be a pretty small fraction of the total work). And mp_cnd_add_n will be the same loop with an additional mask anded with the words read from bp.

For mp_add_1, note that when side-channel silence is needed, in particular, for the signature operation handling the secret key, mp_add_1 cannot return early, but must propagate carry all the way.

Below loop uses the MPY memory mapped multiplier, which produces the 32-bit product of two 16-bit inputs.

define(<RP>, <r15>)
define(<AP>, <r14>)
define(<N>, <r13>)
define(<B>, <r12>)
define(<HI>, <r8>)
define(<S>, <r12>)	C Overlaps B

C mp_addmul_1 (rp, ap, n, b)

mp_addmul_1:
	push	r8

	mov	B, &__MPY
	mov	#0, HI
	jmp	.Lnext

.Loop:
	mov	@AP+, &__OP2
	mov	@RP+, S
	add	HI, S
	mov	#0, HI
	addc	#0, HI
	add	&__RESLO, S
	addc	&__RESHI, HI
	mov	S, -2(RP)
.Lnext:
	sub	#1, N
	jc	.Loop	C !borrow == carry

	mov	r8, r15
	pop	r8
	ret

For devices that also have MPY32 hardware, which supports multiplication of sizes up to 32×32 → 64, it's beneficial to also implement mp_addmul_2(uint16_t *rp, const uint16_t *ap, size_t n, const uint16_t *bpp), computing R ← R + A b where b is a 32-bit number (two 16-bit words).

define(<RP>, <r15>)
define(<AP>, <r14>)
define(<N>, <r13>)
define(<BP>, <r12>)
define(<H0>, <r8>)
define(<H1>, <r9>)
define(<S0>, <r12>)	C Overlaps BP
define(<S1>, <r10>)

C mp_addmul_2 (rp, ap, n, bp)

mp_addmul_2:
	push	r8
	push	r9
	push	r10

	mov	@BP+, &__MPY32L
	mov	@BP, &__MPY32H
	mov	#0, H0
	mov	#0, H1
	bit	#1, N
	jz	.Lnext

	mov	@AP+, &__OP2
	mov	@RP+, S0
	add	&__RES0, S0
	addc	&__RES1, H0
	addc	&__RES2, H1
	mov	S0, -2(RP)
	sub	#1, N
	jmp	.Lnext

.Loop:
	mov	@AP+, &__OP2L
	mov	@AP+, &__OP2H
	mov	@RP+, S0
	mov	@RP+, S1
	add	H0, S0
	mov	#0, H0
	addc	H1, S1
	mov	#0, H1
	addc	#0, H0
	add	&__RES0, S0
	addc	&__RES1, S1
	addc	&__RES2, H0
	addc	&__RES3, H1
	mov	S0, -4(RP)
	mov	S1, -2(RP)
.Lnext:
	sub	#2, N
	jc	.Loop	C !borrow == carry

	mov	H0, 0(RP)
	mov	H1, r15

	pop	r10
	pop	r9
	pop	r8
	ret

On top of these assembly loops, mp_mul can then be done as

void
mp_mul (uint16_t *rp, const uint16_t *ap, size_t an, const uint16_t *bp, size_t bn)
{
  size_t i;

  assert (an >= bn);
  assert (bn > 0);

  if (bn & 1)
    {
      rp[an] = mp_mul_1 (rp, ap, an, bp[0]);
      i = 1;
    }
  else
    {
      rp[an+1] = mp_mul_2 (rp, ap, an, bp);
      i = 2;
    }
  for (; i < bn; i+= 2)
    rp[an + i + 1] = mp_addmul_2 (rp + i, ap, an, bp + i);
}

Squaring is a little messier. If we view the array of word products for a multiply as an array, for squaring the array is symmetric. For off-diagonal elements, only sum and accumulate one of the triangles, left-shift that sum (to take the other identical triangle in account), and add to the diagonal terms. This eliminates almost half of the work compared to a mulyiply, but not quite, due to the additional book-keeping needed. Unrolled code for the smallest corner of the triangle may make sense too, to limit loop overhead.

Mod p arithmetic

The ed25519 curve is defined over the base field of integers mod p, with p = 2²⁵⁵ - 19. It's important that we don't need canonical representation of the field elements. We represent them as unsigned 256-bit numbers, or 16 words of 16 bits each.

Addition

To fold carries, we can use that 2²⁵⁵ = 19 (mod p) and 2²⁵⁶ = 38 (mod p). It's desirable to limit number of passes to a minimum. E.g., for addition, it's possible to directly fold back carry by adding 38, but since that addition in turn can produce carry, one more pass is required. For addition, I would instead suggest the following:

void
ed25519_modp_add(uint16_t *rp, const uint16_t *ap, const uint16_t *bp)
{
  uint16_t cy = mp_add_n(rp, ap, bp, 16);
  hi = rp[15];
  cy = (cy << 1) | (hi >> 15);
  hi &= 0x7fff;
  /* There can be no overflow. */
  rp[15] = hi + mp_add_1(rp, rp, 15, cy * 19);
}

It's not obvious what's the corresponding trick for subtraction is, but there ought to be one. The simpler two-pass method would be

void
ed25519_modp_sub(uint16_t *rp, const uint16_t *ap, const uint16_t *bp)
{
  uint16_t cy = mp_sub_n(rp, ap, bp, 16);
  cy = mp_sub_1 (rp, rp, 16, cy*38);
  cy = mp_sub_1 (rp, rp, 16, cy*38);
  assert (cy == 0);
}

Multiplication and reduction

For both multiplication and squaring, first compute the plain product, then reduce it modulo p. Inputs are 256-bit numbers (16 words), and the plain product is twice as large, 32 words. Since 2²⁵⁶ = 38 (mod p), mp_addmul_1 does almost all the job of the reduction.

/* Clobbers xp input */
void
ed25519_modp(uint16_t *rp, uint16_t *xp)
{
  cy = mp_addmul_1 (xp, xp+16, 16, 38);
  assert (cy <= 38); /* Small */
  hi = xp[15];
  cy = (cy << 1) | (hi >> 15);
  hi &= 0x7fff;
  /* There can be no overflow. */
  rp[15] = hi + mp_add_1(rp, xp, 15, cy*19);
}

void
ed25519_modp_mul(uint16_t *rp, const uint16_t *ap, const uint16_t *bp)
{
  uint16_t pp[32];
  mp_mul (pp, ap, 16, bp, 16);
  ed25519_modp(rp, pp);
}

void
ed25519_modp_sqr(uint16_t *rp, const uint16_t *ap)
{
  uint16_t pp[32];
  mp_sqr (pp, ap, 16);
  ed25519_modp(rp, pp);
}

Inversion and square root

The use of extended or homogeneous coordinates is important because it eliminates most of the divisions, but we still need inversion modulo p to get back from extended to plain cooordinates. For the "point decompression" in signature verification, we also need square root modulo p. And it turns out they can share the most computationally expensive part.

First, let e = (p-5)/8 = 2²⁵²-3, and implement a helper function a → a^e (mod p). This can be done with the addition chain

e = 1 + 4 (2^125+1)(1+2(2^62+1)(2^31+1)(7+8(2^14+1)(2^7+1)(1+2(2^3+1)×7)))

and another helper function (a,k) → a^{2^k+1} (mod p).

For inversion, use Euler's identity a^-1 = a^p-2 (mod p). So we need the exponent

p - 2 = 2²⁵⁵-21 = 1 + 2 (1 + 4 e)

So we can compute the inverse a^-1 (mod p) from a^e (mod p) and an additional 3 squarings and 2 multiplies.

For square root, we can use a special case of the Shanks-Tonelli algorithm, plus a special trick of Daniel J. Bernstein's.

The main computational work for Shanks-Tonelli is computing the candidate square root x = a^(p+3)/8. Now, for point decompression, the input is a fraction, a = u/v (mod p). Instead of inverting v first, use the identity

x = (u/v)^(p+3)/8 = u v³ (u v⁷)^e (mod p)

with the same exponent e as earlier (a trick suggested by Daniel J. Bernstein). If the square root exists, then v x² = ± u (mod p), and in the -u case, we need to adjust x using a fix, precomputed, square root of -1.

Mod q arithmetic

We also need a few operations modulo q, q = 2²⁵² + q₀, with q₀ = 27742317777372353535851937790883648493. This is the order of the elliptic curve (sub)group we use, and q₀ is a 125-bit number. This isn't so performance critical, but it needs to be side-channel silent. Sketch (untested) below:

void
ed25519_modq(uint16_t *rp, size_t n)
{
  static const uint16_t q0[16] = {
    0xd3ed, 0x5cf5, 0x631a, 0x5812, 0x9cd6, 0xa2f7, 0xf9de, 0x14de,
    0, 0, 0, 0, 0, 0, 0, 0,
  };
  static const uint16_t q0_times_16[16] = {
    0x3ed0, 0xcf5d, 0x31a5, 0x8126, 0xcd65, 0x2f79, 0x9dea, 0x4def,
    0x0001, 0,0,0,0,0,0,0,
  };
  static const uint16_t q[16] = {
    0xd3ed, 0x5cf5, 0x631a, 0x5812, 0x9cd6, 0xa2f7, 0xf9de, 0x14de,
    0, 0, 0, 0, 0, 0, 0, 0x1000,
  };

  uint16_t hi, cy;
  while (n-- > 16)
    {
      cy = mp_submul_1 (rp + n - 16, q0_times_16, 16, rp[n]);
      assert(cy < 2);
      mp_cnd_add_n (cy, rp + n - 16, rp + n  16, q, 16);
    }

  uint16_t hi = rp[15];
  uint16_t cy = mp_submul_1(rp, q0, 15, hi >> 12);
  hi &= 0xfff;
  rp[15] = hi - cy;
  mp_cnd_add_n(hi < cy, rp, rp, q, 16);
}

This function produces a canonical representative.

Point operations

We need functions to add two points, to double a point (i.e., add it to itself), and to efficiently add a point to itself many times. The latter is usually called scalar multiplication, multiplying a point by an integer. Unlike other standard ecc curves, operations n the ed25519 curve can be done with "strongly unified" formulas, which work for all possible points without any special cases.

Coordinate choice

A point (x,y) on the curve can be represented in homogeneous (aka projective) coordinates (X, Y, Z), with x = X/Z, y = Y/Z, or extended coordinates (X, Y, Z, T), where the extra T coordinate is defined by x y = T/Z.

Let M denote the cost of one multiplication mod p, and S the cost of one squaring mod p. With homogeneous coordinates, point doubling takes 3M + 4S, while point addition takes 10M +S. Using extended coordinates, doubling takes 4M + 4S (one multiplication more), but point addition is only 9M. In both cases, one saves one M if one of the points added has Z = 1.

The right trade-off is a bit tricky. For scalar multiplication with a fix point, we can use Pippenger's algorithm and we then need more additions than doublings, so using extended coordinates should be faster. But for scalar multiplication with a general point, we would use some variant of a window-based algorithm, which reduces the number of additions but not the number of doublings, and then fast doubling is more important.

To keep things reasonably simple, I recommend starting with homogeneous coordinates (this is also what is used in GNU Nettle). Extended coordinates could be explored if one needs the highest performance for creating signatures. Formulas for homogeneous coordinates: doubling, general addition, addition with Z2 = 1

Secure table lookup

To speed up scalar multiplication, we will tabulate selected points, and index tables by certain bits in the scalar. When signing, the index bits are part of the secret key. On higher end processors, a memory access at a address which depends on secret data introduces a side-channel leak, via the processor cache. I don't know if that is an issue on the MSP430; if so, each table lookup must read the entire contents of the table and extract the desired element using ALU operations, which also makes large tables impractical.

Scalar multiplication of a fixed point

Scalar multiplication by a fixed point, the ecc group's generator, is the main computation when creating a signature. To use Pippengers' algorithm, we precompute suitable tables. For concreteness of the description, we use fixed parameters below, which result in 16 KB of tables).

Let n denote the scalar or exponent, with bits n₀ (least significant) up to and n₂₅₂ (most significant). N should already be reduced modulo q, so it fits in 253 bits. Denote the fixed generator G (this is a point on the curve)

Set c = 6: this is the bit size if the indices to the tables (i.e., each table has 64 elements, but we will have four such tables). Set k = 11: this is the stride with which we access the bits of the scalar (Nettle currently k = 14, which is suboptimal and should be fixed). For the first table T₀, consider the 6-bit index j. Set

j = j₀ + 2 j₁ + 4 j₂ + 8 j₃ + 16 j₄ + 32 j₅.

Then precompute the table value T₀ as

T₀[j] = (j₀ + 2¹¹ j₁ + 2²² j₂ + 2³³ j₃ + 2⁴⁴ j₄ + 2⁵⁵ j₅) G.

Also define three more tables, adding 2⁶⁶ G:

T₁[j] = T₀[j] + 2⁶⁶ G, T₂[j] = T₀[j] + 2¹³² G, T₃[j] = T₀[j] + 2¹⁹⁸ G,

To compute the scalar product P = n G, first extract bits of n starting at position 10, and group them 6 at a time. Let indexing with [10,21,32,43,54,65] be a shorthand for [n₁₀ + 2 n₂₁ + 4 n₃₂ + 8 n₄₃ + 16 n₅₄ + 32 n₆₅]. Then do the first 4 table lookups and 3 adds,

P = T₀[10,21,32,43,54,65] + T₁[76,87,98,109,120,131] + T₂[142,153,164,175,186,197] + T₃[208,219,230,241,252,263]

Next, double P and index using the next lower bits:

P ← 2P + T₀[9,20,31,42,53,64] + T₁[75,86,97,108,119,130] + T₂[141,152,163,174,185,196] + T₃[207,218,229,240,251,262]

Iterate 9 times more, with last update and final result after

P ← 2P + T₀[0,11,22,33,44,55] + T₁[66,77,88,99,110,121] + T₂[132,143,154,163,172,183] + T₃[198,209,220,231,242,253]

Note that if n is reduced modulo q, the highest bit used in the indexing of T₃ is always zero. So we only have 5 relevant index bits, and we need only the first half (32 elements) of that table. With this observation the table size can be reduced from 16 KB to 14 KB.

We get a total of 10 point doublings, 43 point additions, and 44 table lookups. With formulas with homogenous coordinates (points in the tables will always have Z=1), we'll get a total of 417 multiplications and 83 squarings modulo p. With extended coordinates, this is reduced to 384 multiplies and 40 squarings. On the other hand, tables would then need to include the extended coordinate T = XY, so tables would grow from 14 KB to 21 KB.

If 14 KB of tables is too large, other reasonble choices of parameters are, e.g, c = 5, k = 13 (8 KB of tables, 63 point operations) or c = 4, k = 16 (4 KB of tables, 78 point operations).

Scalar multiplication with an arbitrary point

This operation is needed to verify a signature. There are no secret input, so it doesn't need to be side-channel silent. To compute n P, one can use the standard algorithm for sliding-window based exponentiation, see Handbook of Applied Cryptography, algorithm 14.85. One then computes a small table of odd multiples of P at run time, I would suggest using a table with 16 points, P, 3P, 5P, ..., 31P (1.5 KB, note that we need three cordinates per point since it would be too expensive to require Z=1, which also means that we need to use a different variant of the addition formulas). Main part of the algorithm, after the computation of this table, needs 252 point doublings and a smaller number of additions.

Simultaneous scalar multiplication

The scalar multiplication in the signature verification expression could be rearranged as h A + s G, where A is the public key and G is the generator. We can compute h A and s G separately, using sliding-window for the former and Pippenger's algorithm (much faster, thanks to precomputed tables) for the latter.

One can shave off the 10 point doublings for the latter multiplication, and instead add in the terms at the right time in the computation of h A (it would be in time for the last ten doublings). But since we have a large number of doublings anyway, and s isn't secret, it might be better to use a precomputed table of odd multiples and use sliding window also for s G, possibly with a larger table size than for h A. One would then have a single series of 252 point doublings, and add in tabulated multiples of A and G at the right places, depending on the bits of respective scalars.

There are also variations (covered in the Handbook of Applied Cryptography) using a combined table indexed by a combination of bits from s and h. I doubt that is going to bring any speedup in this case.