The ED25519 signature scheme is designed for high performance, and it is attractive for lower-end micro controllers, e.g, the MSP430. I haven't tried implementing it on that platform, but I'd like to provide some advice based on related projects. These notes are based on results from a prestudy done for Yubico AB a few years ago, doing elliptic curves on the MSP430, but with curves other than ed25519, and on the implementation of ECC signatures in GNU Nettle (originally funded by the .SE Internet fund), and my experience with low-level arithmetics in GNU GMP and GNU Nettle.

I'm using the MSP430 as a concrete example, but much of this should be applicable to other platforms.

This document is written by Niels Möller, 2018, and licenced under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Code excerpts are provided mainly for illustrative purposes, and most of the code is copyright Yubico AB. The examples may be copied and used under the terms of the GNU LGPL (Lesser general public license), version 2.1 or later.

Sign operation:

- Point multiplication r G, with fixed base point G and r is a nonce derived from the secret key and the message.
- Two SHA512 digests.
- A few modulo q mul and add operations.

- Decompress, including modulo p sqrt.
- One SHA512 digest.
- Two point multiplications, one with fixed base point.

Main references are High-speed high-security signatures by Daniel J. Bernstein et al and RFC 8032.

Same curve as for curve25519, but with a coordinate change to
treat it as a twisted Edwards curve rather than a Montgomery curve.
Underlying field is defined by the prime p = 2^{255} - 19.

The curve equation is -x² + y² = 1 - d x² y², with d = 121665 / 121666 (mod p). There are several alternatives for coordinates and formulas..

MSP430 is a family 16-bit microcontrollers, featuring 12 general purpose registers, and with several options for memory-mapped multiplication hardware. See User's Guide.

Let us represent bignums as arrays of 16-bit words
(`uint16_t` in C). Functions will mostly
follow GMP
mpn-level conventions, in particular, we store bignums with least
significant word first. We will need

- Bignum addition, adding two numbers of
`n`words, and returning carry out:`uint16_t mp_add_n(uint16_t *rp, const uint16_t *ap, const uint16_t *bp, size_t n);` - Adding a bignum and a single word number
`uint16_t mp_add_1(uint16_t *rp, const uint16_t *ap, size_t n, uint16_t b);` - Conditional addition: sets R ← A + B if
`cnd`is non-zero, otherwise sets R ← A. Implemented with masking, to be side-channel silent.`uint16_t mp_cnd_add_n(uint16_t cnd, uint16_t *rp, const uint16_t *ap, const uint16_t *bp, size_t n);` - Corresponding subtraction functions.
- Computing R ← R + A b, where R and A are bignums, and b is a
16-bit words, returning carry word.
`uint16_t mp_addmul_1(uint16_t *rp, const uint16_t *ap, size_t n, uint16_t c);` - Multiplying two numbers of an and bn words, respectively,
producing the an+bn words of the product. And separate funcion
for the special case of squaring.
`void mp_mul(uint16_t *rp, const uint16_t *ap, size_t an, const uint16_t *bp, size_t bn);``void mp_sqr(uint16_t *rp, const uint16_t *ap, size_t n);`

The source examples here are excerpts from the LGPL licensed code
from the Yubico AB study. Note that assembly code follows GMP and
Nettle conventions with `C` as the comment character, and
assembler files preprocessed using the `m4` macro processor.

`mp_add_n` is an easy assembler loop,

define(<RP>, <r15>) define(<AP>, <r14>) define(<BP>, <r13>) define(<N>, <r12>) C mp_add_n (rp, ap, bp, n) mp_add_n: push r9 push r8 mov #0, r8 jmp .Lnext .Loop: C low bit of r8 is carry in and out bit #1, r8 C get carry back mov @AP+, r9 addc @BP+, r9 mov r9, 0(RP) rlc r8 C save carry add #2, RP .Lnext: C Uses carry, so need to save and restore sub #1, N jc .Loop C !borrow == carry mov r8, r15 and #1, r15 pop r8 pop r9 ret

This loop spends quite a few instructions on moving the carry back
and forth between the status register and register r8. Unrolling for
sizes of interest, in particular n = 16 for ed25519 and curve25519,
might give a significant local speedup (but on the other hand,
additions will be a pretty small fraction of the total work).
And `mp_cnd_add_n` will be the same loop with an additional
mask anded with the words read from `bp`.

For `mp_add_1`, note that when side-channel silence is
needed, in particular, for the signature operation handling the
secret key, mp_add_1 cannot return early, but must propagate carry
all the way.

Below loop uses the MPY memory mapped multiplier, which produces the 32-bit product of two 16-bit inputs.

define(<RP>, <r15>) define(<AP>, <r14>) define(<N>, <r13>) define(<B>, <r12>) define(<HI>, <r8>) define(<S>, <r12>) C Overlaps B C mp_addmul_1 (rp, ap, n, b) mp_addmul_1: push r8 mov B, &__MPY mov #0, HI jmp .Lnext .Loop: mov @AP+, &__OP2 mov @RP+, S add HI, S mov #0, HI addc #0, HI add &__RESLO, S addc &__RESHI, HI mov S, -2(RP) .Lnext: sub #1, N jc .Loop C !borrow == carry mov r8, r15 pop r8 ret

For devices that also have MPY32 hardware, which supports
multiplication of sizes up to 32×32 → 64, it's beneficial to also
implement `mp_addmul_2(uint16_t *rp, const uint16_t *ap, size_t
n, const uint16_t *bpp)`, computing R ← R + A b where b is a
32-bit number (two 16-bit words).

define(<RP>, <r15>) define(<AP>, <r14>) define(<N>, <r13>) define(<BP>, <r12>) define(<H0>, <r8>) define(<H1>, <r9>) define(<S0>, <r12>) C Overlaps BP define(<S1>, <r10>) C mp_addmul_2 (rp, ap, n, bp) mp_addmul_2: push r8 push r9 push r10 mov @BP+, &__MPY32L mov @BP, &__MPY32H mov #0, H0 mov #0, H1 bit #1, N jz .Lnext mov @AP+, &__OP2 mov @RP+, S0 add &__RES0, S0 addc &__RES1, H0 addc &__RES2, H1 mov S0, -2(RP) sub #1, N jmp .Lnext .Loop: mov @AP+, &__OP2L mov @AP+, &__OP2H mov @RP+, S0 mov @RP+, S1 add H0, S0 mov #0, H0 addc H1, S1 mov #0, H1 addc #0, H0 add &__RES0, S0 addc &__RES1, S1 addc &__RES2, H0 addc &__RES3, H1 mov S0, -4(RP) mov S1, -2(RP) .Lnext: sub #2, N jc .Loop C !borrow == carry mov H0, 0(RP) mov H1, r15 pop r10 pop r9 pop r8 ret

On top of these assembly loops, `mp_mul` can then be done as

void mp_mul (uint16_t *rp, const uint16_t *ap, size_t an, const uint16_t *bp, size_t bn) { size_t i; assert (an >= bn); assert (bn > 0); if (bn & 1) { rp[an] = mp_mul_1 (rp, ap, an, bp[0]); i = 1; } else { rp[an+1] = mp_mul_2 (rp, ap, an, bp); i = 2; } for (; i < bn; i+= 2) rp[an + i + 1] = mp_addmul_2 (rp + i, ap, an, bp + i); }

Squaring is a little messier. If we view the array of word products for a multiply as an array, for squaring the array is symmetric. For off-diagonal elements, only sum and accumulate one of the triangles, left-shift that sum (to take the other identical triangle in account), and add to the diagonal terms. This eliminates almost half of the work compared to a mulyiply, but not quite, due to the additional book-keeping needed. Unrolled code for the smallest corner of the triangle may make sense too, to limit loop overhead.

The ed25519 curve is defined over the base field of integers mod p,
with p = 2^{255} - 19. It's important that we don't need
canonical representation of the field elements. We represent them as
unsigned 256-bit numbers, or 16 words of 16 bits each.

To fold carries, we can use that 2^{255} = 19 (mod p) and
2^{256} = 38 (mod p). It's desirable to limit number of
passes to a minimum. E.g., for addition, it's possible to directly
fold back carry by adding 38, but since that addition in turn can
produce carry, one more pass is required. For addition, I would
instead suggest the following:

void ed25519_modp_add(uint16_t *rp, const uint16_t *ap, const uint16_t *bp) { uint16_t cy = mp_add_n(rp, ap, bp, 16); hi = rp[15]; cy = (cy << 1) | (hi >> 15); hi &= 0x7fff; /* There can be no overflow. */ rp[15] = hi + mp_add_1(rp, rp, 15, cy * 19); }It's not obvious what's the corresponding trick for subtraction is, but there ought to be one. The simpler two-pass method would be

void ed25519_modp_sub(uint16_t *rp, const uint16_t *ap, const uint16_t *bp) { uint16_t cy = mp_sub_n(rp, ap, bp, 16); cy = mp_sub_1 (rp, rp, 16, cy*38); cy = mp_sub_1 (rp, rp, 16, cy*38); assert (cy == 0); }

For both multiplication and squaring, first compute the plain product,
then reduce it modulo p. Inputs are 256-bit numbers (16 words), and
the plain product is twice as large, 32 words. Since 2^{256} =
38 (mod p), `mp_addmul_1` does almost all the job of the
reduction.

/* Clobbers xp input */ void ed25519_modp(uint16_t *rp, uint16_t *xp) { cy = mp_addmul_1 (xp, xp+16, 16, 38); assert (cy <= 38); /* Small */ hi = xp[15]; cy = (cy << 1) | (hi >> 15); hi &= 0x7fff; /* There can be no overflow. */ rp[15] = hi + mp_add_1(rp, xp, 15, cy*19); } void ed25519_modp_mul(uint16_t *rp, const uint16_t *ap, const uint16_t *bp) { uint16_t pp[32]; mp_mul (pp, ap, 16, bp, 16); ed25519_modp(rp, pp); } void ed25519_modp_sqr(uint16_t *rp, const uint16_t *ap) { uint16_t pp[32]; mp_sqr (pp, ap, 16); ed25519_modp(rp, pp); }

The use of extended or homogeneous coordinates is important because it eliminates most of the divisions, but we still need inversion modulo p to get back from extended to plain cooordinates. For the "point decompression" in signature verification, we also need square root modulo p. And it turns out they can share the most computationally expensive part.

First, let e = (p-5)/8 = 2^{252}-3, and implement a helper
function a → a^{e} (mod p). This can be done with the addition chain

e = 1 + 4 (2^125+1)(1+2(2^62+1)(2^31+1)(7+8(2^14+1)(2^7+1)(1+2(2^3+1)×7)))and another helper function (a,k) → a

For inversion, use Euler's identity a^{-1} = a^{p-2}
(mod p). So we need the exponent

p - 2 = 2So we can compute the inverse a^{255}-21 = 1 + 2 (1 + 4 e)

For square root, we can use a special case of the Shanks-Tonelli algorithm, plus a special trick of Daniel J. Bernstein's.

The main computational work for Shanks-Tonelli is computing the
candidate square root x = a^{(p+3)/8}. Now, for point
decompression, the input is a fraction, a = u/v (mod p). Instead of
inverting v first, use the identity

x = (u/v)with the same exponent e as earlier (a trick suggested by Daniel J. Bernstein). If the square root exists, then v x^{(p+3)/8}= u v^{3}(u v^{7})^{e}(mod p)

We also need a few operations modulo q, q = 2^{252} +
q_{0}, with q_{0} =
27742317777372353535851937790883648493. This is the order of the
elliptic curve (sub)group we use, and q_{0} is a 125-bit
number. This isn't so performance critical, but it needs to be
side-channel silent. Sketch (untested) below:

void ed25519_modq(uint16_t *rp, size_t n) { static const uint16_t q0[16] = { 0xd3ed, 0x5cf5, 0x631a, 0x5812, 0x9cd6, 0xa2f7, 0xf9de, 0x14de, 0, 0, 0, 0, 0, 0, 0, 0, }; static const uint16_t q0_times_16[16] = { 0x3ed0, 0xcf5d, 0x31a5, 0x8126, 0xcd65, 0x2f79, 0x9dea, 0x4def, 0x0001, 0,0,0,0,0,0,0, }; static const uint16_t q[16] = { 0xd3ed, 0x5cf5, 0x631a, 0x5812, 0x9cd6, 0xa2f7, 0xf9de, 0x14de, 0, 0, 0, 0, 0, 0, 0, 0x1000, }; uint16_t hi, cy; while (n-- > 16) { cy = mp_submul_1 (rp + n - 16, q0_times_16, 16, rp[n]); assert(cy < 2); mp_cnd_add_n (cy, rp + n - 16, rp + n 16, q, 16); } uint16_t hi = rp[15]; uint16_t cy = mp_submul_1(rp, q0, 15, hi >> 12); hi &= 0xfff; rp[15] = hi - cy; mp_cnd_add_n(hi < cy, rp, rp, q, 16); }This function produces a canonical representative.

We need functions to add two points, to double a point (i.e., add it to itself), and to efficiently add a point to itself many times. The latter is usually called scalar multiplication, multiplying a point by an integer. Unlike other standard ecc curves, operations n the ed25519 curve can be done with "strongly unified" formulas, which work for all possible points without any special cases.

A point (x,y) on the curve can be represented in homogeneous (aka projective) coordinates (X, Y, Z), with x = X/Z, y = Y/Z, or extended coordinates (X, Y, Z, T), where the extra T coordinate is defined by x y = T/Z.

Let M denote the cost of one multiplication mod p, and S the cost of one squaring mod p. With homogeneous coordinates, point doubling takes 3M + 4S, while point addition takes 10M +S. Using extended coordinates, doubling takes 4M + 4S (one multiplication more), but point addition is only 9M. In both cases, one saves one M if one of the points added has Z = 1.

The right trade-off is a bit tricky. For scalar multiplication with a fix point, we can use Pippenger's algorithm and we then need more additions than doublings, so using extended coordinates should be faster. But for scalar multiplication with a general point, we would use some variant of a window-based algorithm, which reduces the number of additions but not the number of doublings, and then fast doubling is more important.

To keep things reasonably simple, I recommend starting with homogeneous coordinates (this is also what is used in GNU Nettle). Extended coordinates could be explored if one needs the highest performance for creating signatures. Formulas for homogeneous coordinates: doubling, general addition, addition with Z2 = 1

To speed up scalar multiplication, we will tabulate selected points, and index tables by certain bits in the scalar. When signing, the index bits are part of the secret key. On higher end processors, a memory access at a address which depends on secret data introduces a side-channel leak, via the processor cache. I don't know if that is an issue on the MSP430; if so, each table lookup must read the entire contents of the table and extract the desired element using ALU operations, which also makes large tables impractical.

Scalar multiplication by a fixed point, the ecc group's generator, is the main computation when creating a signature. To use Pippengers' algorithm, we precompute suitable tables. For concreteness of the description, we use fixed parameters below, which result in 16 KB of tables).

Let n denote the scalar or exponent, with bits n_{0} (least
significant) up to and n_{252} (most significant). N should
already be reduced modulo q, so it fits in 253 bits. Denote the
fixed generator G (this is a point on the curve)

Set c = 6: this is the bit size if the indices to the tables
(i.e., each table has 64 elements, but we will have four such
tables). Set k = 11: this is the stride with which we access the
bits of the scalar (Nettle currently k = 14, which is suboptimal and
should be fixed). For the first table T_{0}, consider the
6-bit index j. Set

j = jThen precompute the table value T_{0}+ 2 j_{1}+ 4 j_{2}+ 8 j_{3}+ 16 j_{4}+ 32 j_{5}.

TAlso define three more tables, adding 2_{0}[j] = (j_{0}+ 2^{11}j_{1}+ 2^{22}j_{2}+ 2^{33}j_{3}+ 2^{44}j_{4}+ 2^{55}j_{5}) G.

TTo compute the scalar product P = n G, first extract bits of n starting at position 10, and group them 6 at a time. Let indexing with [10,21,32,43,54,65] be a shorthand for [n_{1}[j] = T_{0}[j] + 2^{66}G, T_{2}[j] = T_{0}[j] + 2^{132}G, T_{3}[j] = T_{0}[j] + 2^{198}G,

P = TNext, double P and index using the next lower bits:_{0}[10,21,32,43,54,65] + T_{1}[76,87,98,109,120,131] + T_{2}[142,153,164,175,186,197] + T_{3}[208,219,230,241,252,263]

P ← 2P + TIterate 9 times more, with last update and final result after_{0}[9,20,31,42,53,64] + T_{1}[75,86,97,108,119,130] + T_{2}[141,152,163,174,185,196] + T_{3}[207,218,229,240,251,262]

P ← 2P + TNote that if n is reduced modulo q, the highest bit used in the indexing of T_{0}[0,11,22,33,44,55] + T_{1}[66,77,88,99,110,121] + T_{2}[132,143,154,163,172,183] + T_{3}[198,209,220,231,242,253]

We get a total of 10 point doublings, 43 point additions, and 44 table lookups. With formulas with homogenous coordinates (points in the tables will always have Z=1), we'll get a total of 417 multiplications and 83 squarings modulo p. With extended coordinates, this is reduced to 384 multiplies and 40 squarings. On the other hand, tables would then need to include the extended coordinate T = XY, so tables would grow from 14 KB to 21 KB.

If 14 KB of tables is too large, other reasonble choices of parameters are, e.g, c = 5, k = 13 (8 KB of tables, 63 point operations) or c = 4, k = 16 (4 KB of tables, 78 point operations).

This operation is needed to verify a signature. There are no secret input, so it doesn't need to be side-channel silent. To compute n P, one can use the standard algorithm for sliding-window based exponentiation, see Handbook of Applied Cryptography, algorithm 14.85. One then computes a small table of odd multiples of P at run time, I would suggest using a table with 16 points, P, 3P, 5P, ..., 31P (1.5 KB, note that we need three cordinates per point since it would be too expensive to require Z=1, which also means that we need to use a different variant of the addition formulas). Main part of the algorithm, after the computation of this table, needs 252 point doublings and a smaller number of additions.

The scalar multiplication in the signature verification expression could be rearranged as h A + s G, where A is the public key and G is the generator. We can compute h A and s G separately, using sliding-window for the former and Pippenger's algorithm (much faster, thanks to precomputed tables) for the latter.

One can shave off the 10 point doublings for the latter multiplication, and instead add in the terms at the right time in the computation of h A (it would be in time for the last ten doublings). But since we have a large number of doublings anyway, and s isn't secret, it might be better to use a precomputed table of odd multiples and use sliding window also for s G, possibly with a larger table size than for h A. One would then have a single series of 252 point doublings, and add in tabulated multiples of A and G at the right places, depending on the bits of respective scalars.

There are also variations (covered in the Handbook of Applied Cryptography) using a combined table indexed by a combination of bits from s and h. I doubt that is going to bring any speedup in this case.