

-------	Chunky 2 Planar Guide	V 0.1 -------


	Main source: log of #c2p tutorial by Scout.
	So credit Kalms, I just cust and paste edited
	this thing (mostly in realtime :) -neure.


-1-  	What c2p is 


	C2P is sort of rotation:

	Input

	a7a6a5a4a3a2a1a0
	b7b6b5b4b3b2b1b0
	c7c6c5c4c3c2c1c0
	d7d6d5d4d3d2d1d0
	e7e6e5e4e3e2e1e0
	f7g6f5f4f3f2f1f0
	g7g6g5g4g3g2g1g0
	h7h6h5h4h3h2h1h0

	Output

	a7b7c7d7e7f7g7h7
	a6b6c6d6e6f6g6h6
	a5b5c5d5e5f5g5h5
	a4b4c4d4e4f4g4h4
	a3b3c3d3e3f3g3h3
	a2b2c2d2e2f2g2h2
	a1b1c1d1e1f1g1h1
	a0b0c0d0e0f0g0h0


-2-	Primary C2P tool: Transposition


	An example 2x2 square transposition:

	ab __\ ac
	cd   / bd


-3-	Transposing 8x8


	Here is an input:

	a7a6a5a4 | a3a2a1a0
	b7b6b5b4 | b3b2b1b0
	c7c6c5c4 | c3c2c1c0
	d7d6d5d4 | d3d2d1d0
	---------+---------
	e7e6e5e4 | e3e2e1e0
	f7f6f5f4 | f3f2f1f0
	g7g6g5g4 | g3g2g1g0
	h7h6h5h4 | h3h2h1h0

	and here's the transpose, split 
	into 4x4-size blocks:

	a7b7c7d7 | e7f7g7h7
	a6b6c6d6 | e6f6g6h6
	a5b5c5d5 | e5f5g5h5
	a4b4c4d4 | e4f4g4h4
	---------+---------
	a3b3c3d3 | e3f3g3h3
	a2b2c2d2 | e2f2g2h2
	a1b1c1d1 | e1f1g1h1
	a0b0c0d0 | e0f0g0h0

-4-	Notes on 8x8 traspositioned

	Now look at the "b" square of the 
	original, and the "c" square of the 
	transpose, ie:

	a3a2a1a0
	b3b2b1b0
	c3c2c1c0
	d3d2d1d0

	and:

	a3b3c3d3
	a2b2c2d2
	a1b1c1d1
	a0b0c0d0


-5-	Transpositions that do C2P on 8x8


	Input split up into 4x4-size squares:

	a7a6a5a4 | a3a2a1a0
	b7b6b5b4 | b3b2b1b0
	c7c6c5c4 | c3c2c1c0
	d7d6d5d4 | d3d2d1d0
	---------+---------
	e7e6e5e4 | e3e2e1e0
	f7f6f5f4 | f3f2f1f0
	g7g6g5g4 | g3g2g1g0
	h7h6h5h4 | h3h2h1h0

	Split up into 2x2-size squares:

	a7a6 | a5a4 | a3a2 | a1a0
	b7b6 | b5b4 | b3b2 | b1b0
	-----+------+------+-----
	c7c6 | c5c4 | c3c2 | c1c0
	d7d6 | d5d4 | d3d2 | d1d0
	-----+------+------+-----
	e7e6 | e5e4 | e3e2 | e1e0
	f7f6 | f5f4 | f3f2 | f1f0
	-----+------+------+-----
	g7g6 | g5g4 | g3g2 | g1g0
	h7h6 | h5h4 | h3h2 | h1h0
	
	Perform transpose on each 4x4 square:

	Swap bit "b" and "c" inside each 2x2-size block
	(which is the same as "split into 1x1-size,
	swap blocks "b" and "c", merge back into
	2x2-size blocks):

	a7b7 | a5b5 | a3b3 | a1b1	2x2 Merge
	a6b6 | a4b4 | a2b2 | a0b0
	-----+------+------+-----
	c7d7 | c5d5 | c3d3 | c1d1
	c6d6 | c4d4 | c2d2 | c0d0
	-----+------+------+-----
	e7f7 | e5f5 | e3f3 | e1f1
	e6f6 | e4f4 | e2f2 | e0f0
	-----+------+------+-----
	g7h7 | g5h5 | g3h3 | g1h1
	g6h6 | g4h4 | g2h2 | g0h0

	Notice here, that the "7351" bits are separated
	from the "6420" bits. (they're in different registers).
	Also, notice sequences like "a7b7", "e3f3", and other
	such things -- there are some _small_ signs of what's happening here.

	Swap block "b" and "c" inside each 4x4 group (= 2x2-size swap):

	a7b7 | c7d7 | a3b3 | c3d3	4x4 Merge
	a6b6 | c6d6 | a2b2 | c2d2
	-----+------+------+-----
	a5b5 | c5d5 | a1b1 | c1d1
	a4b4 | c4d4 | a0b0 | c0d0
	-----+------+------+-----
	e7f7 | g7h7 | e3f3 | g3h3
	e6f6 | g6h6 | e2f2 | g2h2
	-----+------+------+-----
	e5f5 | g5h5 | e1f1 | g1h1
	e4f4 | g4h4 | e0f0 | g0h0

	Remember, that the previous thing done, was to 
	"transpose each 2x2-size block internally", and this 
	was then phase #2: swapping block "b" and block "c" 
	inside the current, kind of

	Merge back into 4x4-size blocks:

	a7b7c7d7 | a3b3c3d3
	a6b6c6d6 | a2b2c2d2
	a5b5c5d5 | a1b1c1d1
	a4b4c4d4 | a0b0c0d0
	---------+---------
	e7f7g7h7 | e3f3g3h3
	e6f6g6h6 | e2f2g2h2
	e5f5g5h5 | e1f1g1h1
	e4f4g4h4 | e0f0g0h0

	Swap block "b" and "c" inside the 8x8-group (ie, 4x4-size blocks):

	a7b7c7d7 | e7f7g7h7	8x8 Merge
	a6b6c6d6 | e6f6g6h6
	a5b5c5d5 | e5f5g5h5
	a4b4c4d4 | e4f4g4h4
	---------+---------
	a3b3c3d3 | e3f3g3h3
	a2b2c2d2 | e2f2g2h2
	a1b1c1d1 | e1f1g1h1
	a0b0c0d0 | e0f0g0h0

	Merge back, and you have it all done!


-6-	How do we do the merge ops?


	In this step-by-step-description, the
	actual _exchanging_ of bits/blocks happened
	first on the smallest levels, and then on
	larger and larger levels (1x1, 2x2, 4x4)
	but in fact this works just as well if it's
	done in the other order.

	But in fact this works just as well if it's
	done in the other order - ie, by defining
	transpose as "first swap blocks b & c, then
	transpose each block in itself")

	We're coming to the question "How do we 
	implement this?". More specifically, "How
	do we perform the bit-exchanges?"
	That's where the bit-merges come in handy.

	Don't worry about register shortage now, 
	assume that you have 1024 of them like on 
	some RISC CPUs :) - and as if you had the
	data in the lowest byte of d0-d7.

	The "bit-merge / merge-op" is just what
	I'm going to show to you now.

	... and you're going to do the 4x4-size "exchange blocks b & c"
	-- let's call that "performing a 4x4-merge"

	.. then, what should be done? I'll copy/past a bit array again:
	[this one's just an example. no intention of you recognizing it. ;)]

	a7b7c7d7 | a3b3c3d3
	a6b6c6d6 | a2b2c2d2
	a5b5c5d5 | a1b1c1d1
	a4b4c4d4 | a0b0c0d0
	---------+---------
	e7f7g7h7 | e3f3g3h3
	e6f6g6h6 | e2f2g2h2
	e5f5g5h5 | e1f1g1h1
	e4f4g4h4 | e0f0g0h0

	.. then, you want to exchange the "a3b3c3d3" 
	part in register d0 with the "e7f7g7h7" part in register d4.

	and, that exchange could be done like this:

	move.b d0,   d8    ; make a copy of that row

	move.b d0,   d8    ; make a copy of that row
	move.b d4,   d9    ; copy that row too
	and.b  #$f0, d0    ; those bits should remain in d0
	and.b  #$0f, d4    ; those bits should remain in d4
	and.b  #$0f, d8    ; those are the ones which should move
	and.b  #$f0, d9    ;  ditto
	lsl.b  #4,   d8    ; you figure this one out on your own. :)
	lsr.b  #4,   d9    ;  ditto

	lsl.b  #4,   d8    ; you figure this one out on your own. :)
	lsr.b  #4,   d9    ; ditto
	or.b   d8,   d4    ; put in the bits
	or.b   d9,   d0


-7-	More Merge ops


	Ok, doing that operation 
	d0<->d4, d1<->d5, d2<->d6, d3<->d7 
	will perform a full 4x4 merge.

	now, how about the 2x2?
	it could be done on the nibble level... but!
	Remember that you want to do another
	2x2 operation between the same registers:

	a7b7 | a5b5 | a3b3 | a1b1
	a6b6 | a4b4 | a2b2 | a0b0
	-----+------+------+-----
	c7d7 | c5d5 | c3d3 | c1d1
	c6d6 | c4d4 | c2d2 | c0d0

	Look at registers d0 and d2 there,
	a7b7 | a5b5 | a3b3 | a1b1	and
	c7d7 | c5d5 | c3d3 | c1d1

	there you want to do a5b5<->c7d7, AND a1b1<->c3d3

	Also notice that they're going
	to shift equally far (ie, both
	a1b1 and a5b5 are going 2 bits left)
	so one can perform both those
	in just one merge:

	move.b d0,   d8   ; copy
	move.b d2,   d9
	and.b  #$cc, d0   ; save the static bits
	and.b  #$33, d2
	and.b  #$33, d8   ; save the dynamic
	and.b  #$cc, d9
	lsl.b  #2,   d8
	lsr.b  #2,   d9
	or.b   d8,   d2   ; add on the bits
	or.b   d9,   d0

	If this all seems too weird to you, 
        draw bitcharts in the comments.... like, 
        after "and.b #$cc,d9" --  ; d9 = c7d7....c3d3....

	Remember that you can do the 1x1, 2x2 and
	4x4 in any order you like. :)


-8-	Preparing C2P for 32bit register usage

 
	Now you know how to make a byte-writing c2p.
	The next step is to try to make it process 
	(and write) longwords instead.
	But before doing so, we will need to
	once again look at some 8x8-bit charts
	to determine some interesting facts.

	a7a6a5a4a3a2a1a0
	b7b6b5b4b3b2b1b0
	c7c6c5c4c3c2c1c0
	d7d6d5d4d3d2d1d0
	e7e6e5e4e3e2e1e0
	f7f6f5f4f3f2f1f0
	g7g6g5g4g3g2g1g0
	h7h6h5h4h3h2h1h0

	(Next will discuss why merges must be done in certain
	 order and how the order can be determined -neure)

	There's an original chart again.

	When doing 1x1, 2x2, 4x4,
	one will perform operation between lines
	"a7...." and "b7....",  
	"a7...." and "c7....", 
	"a7...." and "e7...."

	What I'm trying to show here, is that 
	under _normal_ circumstances, you only 
	need to look at the bit #7 of the 
	bytes a, b, c, e, i, q   
	(= 2^0, 2^1, 2^2, .... = line 0, 1, 2, 4, ...)
	you're never making any merge between a and a ....
	so you could skip a in the list.

	... when you're trying to establish what 
	merges to perform.

	in this case we only have abce to worry about.
	and think like this:

	"What bits are in the same column as a7?"
	"Hmm, b, c and e.  therefore we can 
	 do 1xn, 2xn, or 4xn now."

	( The number of lines that it is down
	  (relative to a) is what determines the 
	  vertical step. )

	Why does this reasoning work?

	Well, the ONLY way to get in "e7" 4 bits
	after "a7" in the same register, is by 
	performing a 4xn merge at the correct moment
	and -- the correct moment is ONLY when e7 
	is in the same column as a7.

	Having this in mind, we can now proceed to
	try our hands at 32x8 bit arrays instead. 
	(= 8 longs, = it will do longword-writes
	to chipmem)

	Btw; if we don't remember this, we can more
	or less blindly try, and try to find other 
	things that coincide, but that takes much 
	more looking and "pattern-matching" than this.

	[In case yo wonder "where the hell did 
	you find out about the last rule?", I 
	answer: after having worked and thought 
	about them for 2 years, I have finally 
	found that pretty simple scheme. :)]


-9-	5 pass C2P with registers


	a7a6a5a4a3a2a1a0 b7b6b5b4b3b2b1b0 c7c6c5c4c3c2c1c0 d7d6d5d4d3d2d1d0
	e7e6e5e4e3e2e1e0 f7f6f5f4f3f2f1f0 g7g6g5g4g3g2g1g0 h7h6h5h4h3h2h1h0
	i7i6i5i4i3i2i1i0 j7j6j5j4j3j2j1j0 k7k6k5k4k3k2k1k0 l7l6l5l4l3l2l1l0
	m7m6m5m4m3m2m1m0 n7n6n5n4n3n2n1n0 o7o6o5o4o3o2o1o0 p7p6p5p4p3p2p1p0
	q7q6q5q4q3q2q1q0 r7r6r5r4r3r2r1r0 s7s6s5s4s3s2s1s0 t7t6t5t4t3t2t1t0
	u7u6u5u4u3u2u1u0 v7v6v5v4v3v2v1v0 w7w6w5w4w3w2w1w0 x7x6x5x4x3x2x1x0
	y7y6y5y4y3y2y1y0 z7z6z5z4z3z2z1z0 A7A6A5A4A3A2A1A0 B7B6B5B4B3B2B1B0
	C7C6C5C4C3C2C1C0 D7D6D5D4D3D2D1D0 E7E6E5E4E3E2E1E0 F7F6F5F4F3F2F1F0

	Since each register is 32 bits,
	and the values are in "non-scrambled order",
	we will at least need to do 1/2/4/8/16bit merges
		
	If we do a bad job we will also need to do some merges twice.
	well, let's reason - look at the a7 column:

	a7
	e7
	i7
	m7
	q7
	u7
	y7
	C7

	See that e and i and q are in the same column
	therefore, 4/8/16bit are possible.

	Let's begin with the 8bit: 8x2

	a7a6a5a4a3a2a1a0 i7i6i5i4i3i2i1i0 c7c6c5c4c3c2c1c0 k7k6k5k4k3k2k1k0
	e7e6e5e4e3e2e1e0 m7m6m5m4m3m2m1m0 g7g6g5g4g3g2g1g0 o7o6o5o4o3o2o1o0
	b7b6b5b4b3b2b1b0 j7j6j5j4j3j2j1j0 d7d6d5d4d3d2d1d0 l7l6l5l4l3l2l1l0
	f7f6f5f4f3f2f1f0 n7n6n5n4n3n2n1n0 h7h6h5h4h3h2h1h0 p7p6p5p4p3p2p1p0
	q7q6q5q4q3q2q1q0 y7y6y5y4y3y2y1y0 s7s6s5s4s3s2s1s0 A7A6A5A4A3A2A1A0
	u7u6u5u4u3u2u1u0 C7C6C5C4C3C2C1C0 w7w6w5w4w3w2w1w0 E7E6E5E4E3E2E1E0
	r7r6r5r4r3r2r1r0 z7z6z5z4z3z2z1z0 t7t6t5t4t3t2t1t0 B7B6B5B4B3B2B1B0
	v7v6v5v4v3v2v1v0 D7D6D5D4D3D2D1D0 x7x6x5x4x3x2x1x0 F7F6F5F4F3F2F1F0

	Now look at the a7 row again.

	And -- e and q remained, that's no wonder -- but:
	b came in! Thus we can now do 1/4/16bit

	Let's do 4bit: 4x1:

	a7a6a5a4e7e6e5e4 i7i6i5i4m7m6m5m4 c7c6c5c4g7g6g5g4 k7k6k5k4o7o6o5o4
	a3a2a1a0e3e2e1e0 i3i2i1i0m3m2m1m0 c3c2c1c0g3g2g1g0 k3k2k1k0o3o2o1o0
	b7b6b5b4f7f6f5f4 j7j6j5j4n7n6n5n4 d7d6d5d4h7h6h5h4 l7l6l5l4p7p6p5p4
	b3b2b1b0f3f2f1f0 j3j2j1j0n3n2n1n0 d3d2d1d0h3h2h1h0 l3l2l1l0p3p2p1p0
	q7q6q5q4u7u6u5u4 y7y6y5y4C7C6C5C4 s7s6s5s4w7w6w5w4 A7A6A5A4E7E6E5E4
	q3q2q1q0u3u2u1u0 y3y2y1y0C3C2C1C0 s3s2s1s0w3w2w1w0 A3A2A1A0E3E2E1E0
	r7r6r5r4v7v6v5v4 z7z6z5z4D7D6D5D4 t7t6t5t4x7x6x5x4 B7B6B5B4F7F6F5F4
	r3r2r1r0v3v2v1v0 z3z2z1z0D3D2D1D0 t3t2t1t0x3x2x1x0 B3B2B1B0F3F2F1F0

	Scanning the edge again:
	b, q = 1bit, 16bit.

	Let's do 16x4:

	a7a6a5a4e7e6e5e4 i7i6i5i4m7m6m5m4 q7q6q5q4u7u6u5u4 y7y6y5y4C7C6C5C4
	a3a2a1a0e3e2e1e0 i3i2i1i0m3m2m1m0 q3q2q1q0u3u2u1u0 y3y2y1y0C3C2C1C0
	b7b6b5b4f7f6f5f4 j7j6j5j4n7n6n5n4 r7r6r5r4v7v6v5v4 z7z6z5z4D7D6D5D4
	b3b2b1b0f3f2f1f0 j3j2j1j0n3n2n1n0 r3r2r1r0v3v2v1v0 z3z2z1z0D3D2D1D0
	c7c6c5c4g7g6g5g4 k7k6k5k4o7o6o5o4 s7s6s5s4w7w6w5w4 A7A6A5A4E7E6E5E4
	c3c2c1c0g3g2g1g0 k3k2k1k0o3o2o1o0 s3s2s1s0w3w2w1w0 A3A2A1A0E3E2E1E0
	d7d6d5d4h7h6h5h4 l7l6l5l4p7p6p5p4 t7t6t5t4x7x6x5x4 B7B6B5B4F7F6F5F4
	d3d2d1d0h3h2h1h0 l3l2l1l0p3p2p1p0 t3t2t1t0x3x2x1x0 B3B2B1B0F3F2F1F0

	Scan edge -- oh, now c got in there
	1/2bit, let's do 2x4:

	a7a6c7c6e7e6g7g6 i7i6k7k6m7m6o7o6 q7q6s7s6u7u6w7w6 y7y6A7A6C7C6E7E6
	a3a2c3c2e3e2g3g2 i3i2k3k2m3m2o3o2 q3q2s3s2u3u2w3w2 y3y2A3A2C3C2E3E2
	b7b6d7d6f7f6h7h6 j7j6l7l6n7n6p7p6 r7r6t7t6v7v6x7x6 z7z6B7B6D7D6F7F6
	b3b2d3d2f3f2h3h2 j3j2l3l2n3n2p3p2 r3r2t3t2v3v2x3x2 z3z2B3B2D3D2F3F2
	a5a4c5c4e5e4g5g4 i5i4k5k4m5m4o5o4 q5q4s5s4u5u4w5w4 y5y4A5A4C5C4E5E4
	a1a0c1c0e1e0g1g0 i1i0k1k0m1m0o1o0 q1q0s1s0u1u0w1w0 y1y0A1A0C1C0E1E0
	b5b4d5d4f5f4h5h4 j5j4l5l4n5n4p5p4 r5r4t5t4v5v4x5x4 z5z4B5B4D5D4F5F4
	b1b0d1d0f1f0h1h0 j1j0l1l0n1n0p1p0 r1r0t1t0v1v0x1x0 z1z0B1B0D1D0F1F0

	... And finally, only an 1x2 to perform.
	Here goes...

	a7b7c7d7e7f7g7h7 i7j7k7l7m7n7o7p7 q7r7s7t7u7v7w7x7 y7z7A7B7C7D7E7F7
	a3b3c3d3e3f3g3h3 i3j3k3l3m3n3o3p3 q3r3s3t3u3v3w3x3 y3z3A3B3C3D3E3F3
	a6b6c6d6e6f6g6h6 i6j6k6l6m6n6o6p6 q6r6s6t6u6v6w6x6 y6z6A6B6C6D6E6F6
	a2b2c2d2e2f2g2h2 i2j2k2l2m2n2o2p2 q2r2s2t2u2v2w2x2 y2z2A2B2C2D2E2F2
	a5b5c5d5e5f5g5h5 i5j5k5l5m5n5o5p5 q5r5s5t5u5v5w5x5 y5z5A5B5C5D5E5F5
	a1b1c1d1e1f1g1h1 i1j1k1l1m1n1o1p1 q1r1s1t1u1v1w1x1 y1z1A1B1C1D1E1F1
	a4b4c4d4e4f4g4h4 i4j4k4l4m4n4o4p4 q4r4s4t4u4v4w4x4 y4z4A4B4C4D4E4F4
	a0b0c0d0e0f0g0h0 i0j0k0l0m0n0o0p0 q0r0s0t0u0v0w0x0 y0z0A0B0C0D0E0F0

	'lo and behold! finito!


-10-	Transpositions summary


	(i)	 8 x 2
        (ii)	 4 x 1 
        (iii)	16 x 4
	(iv)	 2 x 4
	(v)	 1 x 2 


-11-	Idea for HAM6 C2P


	2 4bit pixels in 8byte (ham12bit truecolor)
	read 4 longs; then do 8x1 2x1 16x2 4x2 1x2


-12-	Further optimization


	We start utilising eor ^ logical operation.
	Eor works like this:

	A ^ 0 = A
	A ^ A = 0
	A ^ B ^ C = (A ^ B) ^ C = A ^ (B ^C)

	So one can change the order of EORs sometimes.
	For instance, if you currently do:

	eor.l d0,d1
	eor.l d2,d1

	Then you could change it into

	eor.l d0,d2
	eor.l d2,d1

	The value in d1 will be the same.
	(d2 wil be trashed, though.)

	In the above example, one could say that 
	d2 contains two eor's - the "eor.l d2,d1" does 
	what the first version's "eor.l d0,d1; eor.l d2,d1" did.
	This can be very useful later on...

	Here's a normal merge once again:

	move.l  d0,d6
	move.l  d1,d7
	and.l   #$f0f0f0f0,d0
	and.l   #$f0f0f0f0,d7
	and.l   #$0f0f0f0f,d6
	and.l   #$0f0f0f0f,d1
	lsl.l   #4,d6
	lsr.l   #4,d7
	or.l    d6,d1
	or.l    d7,d0

	... just the AND.L's thrown around a little
	so that they have the $f0f0... ands first, and
	then the $0f0f's. Also notice that we're only
	using 8 dataregs now. :)

	move.l 	d0,d7      	; d7 = aaaaaaaa bbbbbbbb cccccccc dddddddd
	and.l 	#$f0f0f0f0,d0	; d0 = aaaa.... bbbb.... cccc.... dddd....
	eor.l 	d0,d7 		; d7 = ....aaaa ....bbbb ....cccc ....dddd

	Now how about that? :)

	Well, so let's have a look at what we can
	make the merge-op to look like with two
	eors instead:

	move.l  d0,d6		; d6 = aaaaaaaa bbbbbbbb cccccccc dddddddd
	move.l  d1,d7		; d1 = eeeeeeee ffffffff gggggggg hhhhhhhh
	and.l   #$f0f0f0f0,d0   ; d0 = aaaa.... bbbb.... cccc.... dddd....
	and.l   #$f0f0f0f0,d7	; d7 = eeee.... ffff.... gggg.... hhhh....
	eor.l   d0,d6		; d6 = ....aaaa ....bbbb ....cccc ....dddd
	eor.l   d7,d1		; d1 = ....eeee ....ffff ....gggg ....hhhh
	lsl.l   #4,d6		; d6 = ......aa ......ff ......g
	lsr.l   #4,d7
       eor.l    d6,d1		; was or
       eor.l    d7,d0		; was or

	Becomes:

move.l  #$f0f0f0f0,d4
move.l  d0,d6           ; d6 = a7a6a5a4a3a2a1a0 b7b6b5b4b3b2b1b0 c7c6c5c4c3c2c1c0 d7d6d5d4d3d2d1d0
move.l  d1,d7		; d7 = e7e6e5e4e3e2e1e0 f7f6f5f4f3f2f1f0 g7g6g5g4g3g2g1g0 h7h6h5h4h3h2h1h0
and.l   d4,d0		; d0 = a7a6a5a4........ b7b6b5b4........ c7c6c5c4........ d7d6d5d4........
and.l   d4,d7		; d7 = e7e6e5e4........ f7f6f5f4........ g7g6g5g4........ h7h6h5h4........
eor.l   d0,d6		; d6 = ........a3a2a1a0 ........b3b2b1b0 ........c3c2c1c0 ........d3d2d1d0
eor.l   d7,d1		; d1 = ........e3e2e1e0 ........f3f2f1f0 ........g3g2g1g0 ........h3h2h1h0
lsl.l   #4,d6		; d6 = a3a2a1a0........ b3b2b1b0........ c3c2c1c0........ d3d2d1d0........
lsr.l   #4,d7		; d7 = ........e7e6e5e4 ........f7f6f5f4 ........g7g6g5g4 ........h7h6h5h4
eor.l   d6,d1		; d1 = a3a2a1a0e3e2e1e0 b3b2b1b0f3f2f1f0 c3c2c1c0g3g2g1g0 d3d2d1d0h3h2h1h0
eor.l   d7,d0		; d0 = a7a6a5a4e7e6e5e4 b7b6b5b4f7f6f5f4 c7c6c5c4g7g6g5g4 d7d6d5d4h7h6h5h4

	There you are.

	let's have a look at the AND/EORing in the merge.
	what will happen to d0?

	The bits

	........a3a2a1a0 ........b3b2b1b0 ........c3c2c1c0 ........d3d2d1d0

	become removed,	and

	........e7e6e5e4 ........f7f6f5f4 ........g7g6g5g4 ........h7h6h5h4

	become added on. That is like:

move.l  #$0f0f0f0f,d4
move.l  d0,d6	; d6 = a7a6a5a4a3a2a1a0 b7b6b5b4b3b2b1b0 c7c6c5c4c3c2c1c0 d7d6d5d4d3d2d1d0
move.l  d1,d7	; d7 = e7e6e5e4e3e2e1e0 f7f6f5f4f3f2f1f0 g7g6g5g4g3g2g1g0 h7h6h5h4h3h2h1h0
lsr.l   #4,d7	; d7 = ........e7e6e5e4 e3e2e1e0f7f6f5f4 f3f2f1f0g7g6g5g4 g3g2g1g0h7h6h5h4
and.l   d4,d6	; d6 = ........a3a2a1a0 ........b3b2b1b0 ........c3c2c1c0 ........d3d2d1d0
and.l   d4,d7	; d7 = ........e7e6e5e4 ........f7f6f5f4 ........g7g6g5g4 ........h7h6h5h4
eor.l   d7,d6	; d6 = d6 ^ d7
eor.l   d6,d1	; d1 = a7a6a5a4e7e6e5e4 b7b6b5b4f7f6f5f4 c7c6c5c4g7g6g5g4 d7d6d5d4h7h6h5h4

	Either you could do

	eor.l d7,d0
	eor.l d6,d0

	... or you could do

	eor.l d7,d6
	eor.l d6,d0


-13-	Optimizing to 7 instruction merge op


	Let's make a small optimization.
	instead of ANDing against both d6 & d7,
	and then EORing d7 onto d6, let's EOR d7
	onto d6, and then AND against d6 only:

move.l  #$0f0f0f0f,d4
move.l  d0,d6	; d6 = a7a6a5a4a3a2a1a0 b7b6b5b4b3b2b1b0 c7c6c5c4c3c2c1c0 d7d6d5d4d3d2d1d0
move.l  d1,d7	; d7 = e7e6e5e4e3e2e1e0 f7f6f5f4f3f2f1f0 g7g6g5g4g3g2g1g0 h7h6h5h4h3h2h1h0
lsr.l   #4,d7	; d7 = ........e7e6e5e4 e3e2e1e0f7f6f5f4 f3f2f1f0g7g6g5g4 g3g2g1g0h7h6h5h4
eor.l   d7,d6	; d6 = d6 ^ d7
and.l   d4,d6	; (d6 ^ d7) & $0f0f0f0f
eor.l   d6,d0	; d0 = a7a6a5a4e7e6e5e4 b7b6b5b4f7f6f5f4 c7c6c5c4g7g6g5g4 d7d6d5d4h7h6h5h4

	what (bits) did we eor against d0? namely,

	........a3a2a1a0 ........b3b2b1b0 ........c3c2c1c0 ........d3d2d1d0	and
	........e7e6e5e4 ........f7f6f5f4 ........g7g6g5g4 ........h7h6h5h4

	now I look at the "old" merge, and check what should get EORed against d1

	e7e6e5e4........ f7f6f5f4........ g7g6g5g4........ h7h6h5h4........	and
	a3a2a1a0........ b3b2b1b0........ c3c2c1c0........ d3d2d1d0........

	(the 1st line should be removed, and the 2nd added to d1)

	These are the same, just shifted a little..
	"Remove the 7654 bits, insert the 3210 ones << 4"

	So let's give the last piece of code we touched a look
	In d6 the (d6 ^ d7) & $0f0f0f0f mask resides;
	All that it takes to mangle d1 is:

	lsl.l #4,d6
	eor.l d6,d1

	and then we're done:

move.l  d1,d7		; d7 = e7e6e5e4e3e2e1e0 f7f6f5f4f3f2f1f0 g7g6g5g4g3g2g1g0 h7h6h5h4h3h2h1h0
lsr.l   #4,d7		; d7 = ........e7e6e5e4 e3e2e1e0f7f6f5f4 f3f2f1f0g7g6g5g4 g3g2g1g0h7h6h5h4
eor.l   d0,d7		; d7 = mask0 ^ mask1   still dirty
and.l   #$0f0f0f0f,d7	; d7 = mask
eor.l   d7,d0		; d0 = a7a6a5a4e7e6e5e4 b7b6b5b4f7f6f5f4 c7c6c5c4g7g6g5g4 d7d6d5d4h7h6h5h4
lsl.l   #4,d7		; d7 = mask << 4
eor.l   d7,d1		; d1 = a3a2a1a0e3e2e1e0 b3b2b1b0f3f2f1f0 c3c2c1c0g3g2g1g0 d3d2d1d0h3h2h1h0

	Just the same thing as before, just a little
	different order when sweeping together the stuff.


-14-	Optimizing towards 6 instruction merge op

	Instead of doing the bit-shuffling on
	the diagonal, one of the regs are ROR/ROLed
	and then the job is done vertically.

	ror.l #4,d1
	move.l d0,d7
	eor.l d1,d7
	and.l #$0f0f0f0f,d7	; there we have a simple "vertical" mask
	eor.l d7,d0
	eor.l d7,d1
	rol.l #4,d1	

	You can drop the last rol away. Then you need
	a new set of transpositions, and add a few
	extra rotations to the code, to unrotate
	registers back for writing to planes.

	So the code will end up
	5*4*6 + N, N = 4..7 instructions.


-15-	Blitter


	Blitter with A&D channels active on a500 copied 74k/frame I think
	Blitter with ABD channels on 1200 works about 40k/frame

	First it reads data from channels A-B-C
	You can turn off any of them so they contain
	bitmasks instead when it's read one word from each,

	It can shift A and/or B by N pixels (specifiable)
	After that it will combine the three values
	into one using the "minterms"

	I won't talk about minterms now, but I'll tell
	you that one can program the blitter to do this

	(A & ~C) | ((B >> 4) & C)

	... and imagine that C = $0f0f

	I changed another error there too, yes.
	The first part will keep the _high_ nibbles
	from source A, and insert the _high_ nibbles
	from source B into the _low_ nibbles of the destination
	which is 50% of the "old" merge-op.

	First that blit, and then another blit.
	Going in the opposite direction cause the
	blitter can't LSL/LSR freely (it can only
	shift in one direction by "delaying")

	((A << 4) & ~C) | (B & C)

	What'll give you the other datas.

	There's one more thing to see here though
	The blitter only deals with words
	Therefore you can jump around with the modulos
	And by doing so, yu can get the 16bit merge "for free".
	So, lets have merge order 8x2 4x1 1x2 16x4 2x4
	After the 1x2, let the CPU write the data to chipmem
	And then the blitter does the 16x4 while performing the 2x4
	There's cpu3blit1.

	But remember, this is only useful on 020/030;
	on 040+ cpu5 is the only for 1x1. :)

