Cryptanalysis 2 (by alaric)
One problem with using 16-bit S-boxes is that it uses 128KB of RAM. The other problem is that, when implementing the algorithm as a circuit, it'd be nice to be able to do the two S-boxes in each layer in parallel - and ideally to pipeline it and do all the layers in parallel, too. Implementing the S-box as RAM needs 128KB, or 1Mbit of storage, which is large and expensive compared to the hardware needed for many other algorithms; and I'm not sure if a 1Mbit RAM with six read ports plus a write port for keying would be practical, so you may need multiple copies of it (three dual-ported RAMs?).
So I also wondered about using 8-bit S-boxes, which consume a mere 2Kbits. Take an input block, of any number of bytes really - for comparison with previous algorithms, I use four bytes, ABCD. Pass each byte through the S-box in turn, then run the resulting four bytes EFGH through my avalanche function to get E'F'G'H' (E'=XOR(F,G,H), F'=XOR(E,G,H), G'=XOR(E,F,H), H'=XOR(E,F,G)), then S-box each of them to get the outputs; A'=S(E'), B'=S(F'), C'=S(G'), D'=S(H').
This algorithm will obviously be weak; narrow S-boxes, only using a single round of the avalanceh function with an S-box the same size as the avalanche block so changes only propogate across three subblocks, etc.
However, try as I might, I can't find a pattern of chosen plaintexts that gets you anything useful other than S(S(N)) for all N. I've tried keeping ABC constant and varying D (A'=B'=C'=S(S(D))). I've tried varying AB while CD=AB; you get E'=XOR(S(B),S(C),S(D)) but D=B so that's just E'=S(C) and A=C so A'=S(S(A)), and so on. And I've tried keeping A=B=C and varying D, to similar effect.
So, I wonder to myself, is it possible to deduce S from S2? I'm not honestly sure. I noticed that there's a wide range of possible self-inverting S-boxes - S(X) = 256-X, S(X) = XOR(X,N), S(X)=N-X, etc all of which would result in the same S2 - the identity S-box. But scribbling around on paper has failed to reveal other groups of S-boxes which all have the same square. Nor has it managed to reveal a successful reverse-engineering of S from S2.
But the fact that all self-inverting S-boxes square to the same identity S-box tells me that only a subset of all possible S-boxes can be squares of other S-boxes, which is interesting.
So no, I'd say that this algorithm is not secure, because I have a hunch that finding S2 is just too much to give away. But it's tantilisingly tricky to extract much information from, unless I've missed something else. However, adding more rounds will not help; the attacks to extract S2 all work by keeping some input bytes equal to each other, and the same equalities appear on the output, and would continue to appear for any number of rounds. Instead of S2 we'd get S3, and my argument that finding S from S2 might not always be possible due to all self-inverting S having the same S2 does not hold for S3 - so this may be less secure.
However, since 8-bit S-boxes are small at 2KBits, we might just use eight differently keyed S-boxes to implement the algorithm, using 16KBits (and highly parallelisable in hardware), at which point every output byte is a function of three input bytes, each passed through their own S-box, the results XORed together, and the result of that then passed through an S-box corresponding to that output.
Perhaps this can be attacked by having A=B=C and trying all 256 D. That will give you all 256 possible outputs for A', B', and C', but you still don't know what the inputs to each output S-box were; they won't even be the same since each output S-box has, as input, the varying values of H (the output of D's input S-box) XORed with different combinations of E, F, and G, the values of which will depend on the different input S-boxes.
Yep, this one looks quite secure for a 32-bit block algorithm, but would take much less hardware than the previous one. However, we still have the enumerate-all-plaintexts problem, since it's using a 32 bit block. And each output byte is still a function of only three input bytes, not all four.
So why not scale it up to a 256-bit block? Well, we'd need eight times as many S-boxes; not 16Kbits any more, but 128Kbits. Still not too bad.
And we should make every output bit a function of every input bit by having an extra round of avalanche then S-box. So an extra 64Kbits of S-boxes; 192Kbits now. I'm beginning to like it.
But we may still have a weakness. With two 8-bit avalanche functions we have 16 bits of 'avalanche bandwidth', but 256-bits of data being encrypted. If we keep all but three input bytes constant and go through all 224 combinations of those three bytes, then the outputs of the second layer of S-boxes that are functions of all three changing bytes will, as a group, only go through 256 different values, since no matter how those three bytes change, the XOR sum of them can only have 256 different values. The three output bytes which are a function of only two of the changing input bytes are each functions of different choices of two out of three, meaning that those three bytes will go through all 224 different values.
However, the second avalanche function and third row of S-boxes destroys this pattern; the outputs which are a function of all three highly-changing intermediate values will only have 256 possible different changes from those three, as before, but each output byte will also be a different function of all but one of the other bytes, and thanks to the previous round, they are also changing, whereas in the first round they were constant. So there will be 224 distinct patterns of them; and I think that means there's 224 distinct combinations of any group of three or more output bytes, which doesn't help us deduce anything.