I had a couple of 6502-based computers when I was a teenager. VIC-20, C64, C128, C128D. I did a lot of hacking on them, including in assembly language, and I considered myself pretty good at it. But I didn't have much contact with anybody else doing 6502 programming. Remember, this was several years before the age of the Internet, and I didn't even have a modem.

Recently, as part of a small consulting project, I have had reason to look back at 6502 programming, in particular to find out it's limits when it gets to bignum operations as used in public key cryptography. How fast can one multiply numbers of a few hundred bits?

I've been looking into this problem. I have also found some old tricks from 6502 old timers on the net. Those tricks are beyond what I used or could come up with back in the day, but they may have been well known by less isolated hackers, with hacker groups as well as professionals.

Let me first describe the basics of bignum multiplication. Large
integers are represented as an array of words, interpreted as
digits in base *B* representing the machine word size, with
little-endian ordering. E.g., on the 6502, one would
have *B=2 ^{8}*. The main work horse for basic
schoolbook multiplication is the function addmul_1, which
multiplies a bignum by a single word, and adds the result,
in-place, to another bignum. It may look like this:

void mul (word *rp, const word *ap, const word *bp, size_t n) { size_t i; rp[n] = mul_1 (rp, ap, n, b[0]) for (i = 1; i < n; i++) rp[n+i] = addmul_1 (rp+ i, ap, n, bp[i]); }

This quadratic algorithm should be used only up to some threshold size. For larger numbers, one should use a divide-and-conquer algorithm such as Karatsuba multiplication, falling back to the schoolbook algorithm when the numbers get small enough.

Since 6502 lacks any multiplication instruction, the most obvious way to multiply is to use shift and add. For an 8 × 8 bit product, one would need 8 shifts and an average of 4 adds. The following clever loop is from Leif Stensson. It should take 130 cycles on average, if I count it correctly.

; factors in factor1 and factor2 LDA #0 LDX #$8 LSR factor1 loop: BCC no_add CLC ADC factor2 no_add: ROR ROR factor1 DEX BNE loop STA factor2 ; done, high result in factor2, low result in factor1

If we implement schoolbook multiplication in the standard way,
using shift-and-add for each *8 × 8* bit product, we will
do *O(n ^{2})* 8-bit shifts. The number
of shifts can be reduced by using the following algorithm.

- Zero the result area.
- Examine bit 0 of each word in
*A*, and for each bit set, add in*B*at the correct place in the result area. - Shift
*B*left one bit, into a scratch area*T*. - Examine bit 1 of each word in
*A*, and for each bit set, add in*T*at the correct place in the result area. - Iterate 6 times more, shifting the scratch
*T*in place, and examining the next bit of each word in*A*.

With this algorithm a multiplication of *n × n* bits needs
roughly *n* 8-bit shift instructions
and *n ^{2}/16* 8-bit adds. An addition loop takes
23 cycles per add if we use indirect addressing, for a total
of

Can one use loookup tables, to speed up multiplication?
The *1.5 n ^{2}* cycles needed for the bignum shift-and-add
algorithm above corresponds to 96 cycles per

Using a full * 8 × 8 * multiplication table is out of the
question, since it wouldn't fit in the 16-bit address space. When
working on the problem, I considered using a table for * 4 ×
8 * multiplication, which would need 6, 8 or 12 KByte,
depending on how compactly it is stored.

Then I stumbled on a posting by a C64 old timer, reminding me of
the formula * xy = (x ^{2} + y^{2} -
(x-y)^{2})/2*. So one can do an

; In: Factors in A and X ; Out: High byte in A, low byte in X multiply: sta $fd cpx $fd bcc sorted txa ldx $fd sorted: sta $ff stx $fd sec sbc $fd tay ldx $ff lda $sqtab_lsb,x sbc $sqtab_lsb,y sta $fe lda $sqtab_msb,x sbc $sqtab_msb,y sta $ff clc ldx $fd lda $fe adc $sqtab_lsb,x sta $fe lda $ff adc $sqtab_msb,x ror ror $fe ldx $fe

Some other variants are possible. One could preshift the values in
the table, storing *floor(x ^{2}/2)* rather
than

One important observation is that when used for mul_1 and addmul_1, one of the operands is a loop invariant, so one would need only two table lookups. One can also write separate code for the case of the invariant operand being even or odd (this simplifies the logic if we use a preshifted table).