Following up the second week task for the excellent Cryptography course by Prof. Jonathan Katz at Coursera, and took the second programming assignment. This time it is about breaking the one-time pad encryption when the code was reused, and more than one ciphertext is intercepted. Again, the suggested approach required too much manual work, and I decided to extend the previous solution and see whether it would work here as well.
The analysis of previous solution applies here mostly, but with two major differences:
- Instead of applying the key to the same ciphertext at different offsets, we apply the key at the same offset but to different ciphertexts – and the key should decode the text into something meaningful;
- For the part of key recovered we would only find the length of the words recovered; finding out encoded “the ” in the text would give us only 4 bytes from the key, out of 32. Of course it is possible that “the ” appears at different offsets in different ciphertexts – as it would likely be the real world case – but as soon as you start trying, you’ll find out this is not the case, and the appearance of “the ” in the ciphertext is minimal.
However it is still possible to perform the analysis on the text, and recover the key.
First we look at the ciphertext. Looking carefully at the end of it we can see it ends up with the same hex code, 1E with one exception. Since those are supposed to be complete English sentences, and a sentence typically ends up with a dot, we can safely assume 1E encodes a dot. Then it is easy to find what value was used in a key to XOR a dot into 1E, and see what the value 0F was encoded into, which further confirms our guess.
So this was an easy guess, but it does not help us at all with the other keys. What we do then? Use the dictionary attack. The text has other words besides “the”, and each of them has a chance to be found.
To start, I focused on adverbs (then, that, than, there, and so on), longer pronouns (you, she, they) and adverbs/conjunctions (what, where, why), followed by a space, a comma or a dot. Similar to the previous project, the following algorithm was used:
- For each target word in “the that then…”
- For each symbol in “space dot comma”
- Create a full target by appending the symbol to the key
- For offset in 0 … 31
- For ciphertext string in 0 – 7
- Create a key which, when applied at the offset to the current ciphertext, would decode the data at the offset as fulltarget;
- Apply the same key to other ciphertexts and validate the decoded characters as described below;
- Print those which pass the validations, together with the used key and offset.
- For ciphertext string in 0 – 7
- For each symbol in “space dot comma”
After the script runs, I looked at the validated text. Had to tight up the validation rules more because, unlike the previous assignment, digits were allowed in this text, so the first run created a number of false positives. All of them were eliminated by:
- Disallowing combinations such as “aa8” or “8aa”, i.e. digit followed by letter or a letter followed by the digit without space, as those would be unlikely in the text;
- Disallowing most of the symbols – since the text supposed to contain English sentences, a chance to see any kind of brackets or math signs was negliginble.
- If the dot or comma was present, it should be followed by space, i.e. combinarions such as “a.b” are not allowed.
After the second run, very few texts looks meaningful, and after those were put into the key and the ciphertexts decoded, it revealed enough material to work on. The parts of the words revealed which were confirmed (meaning they decoded the meaningful text in at least 3 sentences), were not useless – the script searched them in a dictionary of English words, and attempted all the words in which this part was present, starting from the longest words as they’d give us more key material. In five minutes the key was cracked completey using the Perl script. Using C code would probably be 10x faster, but writing Perl code is faster and I wasn’t in hurry. After all, this is not red line between DC and Moscow.
Finally, what is the benefit of this versus the discussed solution with XORing ciphertexts? The obvious benefit is that this solution does not rely on spaces in text, and will work in case when the spaces in text match completely (hence decoding a space will not give you any other characters in the key), and even if the text has no spaces at all. For example, this algorithm will likely work on BASE64-encoded plaintext (which will contain no spaces).