Cryptanalysis 2 (by alaric)
I went back and reread my discussion on how many rounds you need, and found it wrong; I came up with an approximate formula for how many bits are changed in the output for a single-bit change in the input for R rounds, and of course, this is nearly all of them in one round, and all of them in two rounds. But we knew that; that's not what I wanted to work out. So for now I'm considering that the avalanche function just spreads changes around, and I'll assume that each round (with a 16 bit S-box) causes an average change of 8 bits in the next round from a single bit change on the input; this is the average number of bits changed in the output of a 16-bit S-box from a single bit input change, and it's also the number of bits of data I spread across the blocks in the avalanche stage. So the change after R rounds (including the initial row of S-boxes) is 8R+1; if we want that to be more than or equal to the message length L, then we need (R+1)log28 >= log2 L. log28 = 3, so R >= (log2 L)/3 - 1; for a 1024-bit message, that's 2 and a third rounds, so 3 rounds will do. In the case of 8-bit S-boxes with 8-bit avalanche between, I'd use the fact that an 8-bit S-box will cause on average four bits of change in the output for a single bit input change, so use R >= (log2 L)/2 - 1 instead; four rounds in that case. More rounds, but much less S-box space is needed.
I've tried attacking the 16-bit S-box with 8-bit avalanche algorithm by keeping some input bytes equal and changing others and all that, but it's even harder than the others, because of the way the avalanche function makes the input of each S-box in the next layer a function of the outputs of lots of other S-boxes. You can get somewhere in the first round by changing only two adjacent input bytes and keeping all the other input byte pairs constant, so that the identical S-box outputs after the first layer cancel each other out in the XOR for two of the outputs of the avalanche - but that then gives you insufficient control over the inputs to the next level of S-boxes to do anything useful; all of the inputs to the second row of S-boxes will be changing. You definitely need a minimum of two rounds, of course.
So unless I've missed something, we may have another winner. The 8-bit S-box algorithm takes up much less RAM, but its security comes from using unrelated S-boxes, so it doesn't scale to arbitrary block sizes easily; it'll only be practical for small bounded block sizes such as 256 bits, while the 16-bit algorithm uses a constant amount of memory for the S-boxes - but that constant amount is large, the same as 512 8-bit S-boxes, or 170 bytes using the above algorithm with two rounds; but that's a 1360 bit block and we should be using four rounds to securely encrypt it according to the above reasoning, so perhaps we should split those 512 S-boxes across 5 layers and encrypt up to 102 bytes. Unless we're wanting a fast parallel implementation in hardware, 16-bit S-boxes become more memory efficient with blocks of more than a hundred bytes.
Which is more secure? The 8-bit algorithm evades my analysis due to the unrelated S-boxes, while the 16-bit algorithm evades it due to having avalanche over a smaller block size than the S-box width, destroying the symmetries I have attacked earlier algorithms on.
So perhaps a combination of the two would be best - perhaps alternating rounds of re-use of the same 16-bit S-box then differing 8-bit S-boxes.