The good old 6502

I had a couple of 6502-based computers when I was a teenager. VIC-20, C64, C128, C128D. I did a lot of hacking on them, including in assembly language, and I considered myself pretty good at it. But I didn't have much contact with anybody else doing 6502 programming. Remember, this was several years before the age of the Internet, and I didn't even have a modem.

Recently, as part of a small consulting project, I have had reason to look back at 6502 programming, in particular to find out it's limits when it gets to bignum operations as used in public key cryptography. How fast can one multiply numbers of a few hundred bits?

I've been looking into this problem. I have also found some old tricks from 6502 old timers on the net. Those tricks are beyond what I used or could come up with back in the day, but they may have been well known by less isolated hackers, with hacker groups as well as professionals.

Bignum multiply

Let me first describe the basics of bignum multiplication. Large integers are represented as an array of words, interpreted as digits in base B representing the machine word size, with little-endian ordering. E.g., on the 6502, one would have B=28. The main work horse for basic schoolbook multiplication is the function addmul_1, which multiplies a bignum by a single word, and adds the result, in-place, to another bignum. It may look like this:


	void mul (word *rp, const word *ap, const word *bp, size_t n)
	{
	  size_t i;
	  rp[n] = mul_1 (rp, ap, n, b[0])
	  for (i = 1; i < n; i++)
	    rp[n+i] = addmul_1 (rp+ i, ap, n, bp[i]);
        }
  

This quadratic algorithm should be used only up to some threshold size. For larger numbers, one should use a divide-and-conquer algorithm such as Karatsuba multiplication, falling back to the schoolbook algorithm when the numbers get small enough.

Shift and add

Since 6502 lacks any multiplication instruction, the most obvious way to multiply is to use shift and add. For an 8 × 8 bit product, one would need 8 shifts and an average of 4 adds. The following clever loop is from Leif Stensson. It should take 130 cycles on average, if I count it correctly.


	  ; factors in factor1 and factor2
	  LDA #0
	  LDX  #$8
	  LSR  factor1
	loop:
	  BCC  no_add
	  CLC
	  ADC  factor2
	no_add:
	  ROR
	  ROR  factor1
	  DEX
	  BNE  loop
	  STA  factor2
	  ; done, high result in factor2, low result in factor1	
  

If we implement schoolbook multiplication in the standard way, using shift-and-add for each 8 × 8 bit product, we will do O(n2) 8-bit shifts. The number of shifts can be reduced by using the following algorithm.

  1. Zero the result area.
  2. Examine bit 0 of each word in A, and for each bit set, add in B at the correct place in the result area.
  3. Shift B left one bit, into a scratch area T.
  4. Examine bit 1 of each word in A, and for each bit set, add in T at the correct place in the result area.
  5. Iterate 6 times more, shifting the scratch T in place, and examining the next bit of each word in A.

With this algorithm a multiplication of n × n bits needs roughly n 8-bit shift instructions and n2/16 8-bit adds. An addition loop takes 23 cycles per add if we use indirect addressing, for a total of 1.5 n2 cycles.

Using table lookups

Can one use loookup tables, to speed up multiplication? The 1.5 n2 cycles needed for the bignum shift-and-add algorithm above corresponds to 96 cycles per 8 × 8 bit product. So if we can find a way to do an 8 × 8 faster than that, with some margin to allow for addmul_1 accumulation, we could beat shift-and-add.

Using a full 8 × 8 multiplication table is out of the question, since it wouldn't fit in the 16-bit address space. When working on the problem, I considered using a table for 4 × 8 multiplication, which would need 6, 8 or 12 KByte, depending on how compactly it is stored.

Then I stumbled on a posting by a C64 old timer, reminding me of the formula xy = (x2 + y2 - (x-y)2)/2. So one can do an 8 × 8 by three lookups in a small squaring table, of just 512 bytes, and a few additional shifts and adds. He also showed an implementation, running in 79 or 83 cycles, depending on the branch,


	; In: Factors in A and X
	; Out: High byte in A, low byte in X

      multiply:
	sta $fd
	cpx $fd
	bcc sorted
	txa
	ldx $fd
      sorted:
	sta $ff
	stx $fd
	sec
	sbc $fd
	tay
	ldx $ff
	lda $sqtab_lsb,x
	sbc $sqtab_lsb,y
	sta $fe
	lda $sqtab_msb,x
	sbc $sqtab_msb,y
	sta $ff
	clc
	ldx $fd
	lda $fe
	adc $sqtab_lsb,x
	sta $fe
	lda $ff
	adc $sqtab_msb,x
	ror
	ror $fe
	ldx $fe	
  

Some other variants are possible. One could preshift the values in the table, storing floor(x2/2) rather than x2 (but then one needs some special handling to get the correct result when both inputs are odd). Or use the alternative formula xy = ((x+y)2 - x2 - y2)/2.

One important observation is that when used for mul_1 and addmul_1, one of the operands is a loop invariant, so one would need only two table lookups. One can also write separate code for the case of the invariant operand being even or odd (this simplifies the logic if we use a preshifted table).