I have signed up for the excellent Cryptography course by Prof. Jonathan Katz at Coursera, and took the first programming assignment which was about breaking the Vigenere cipher. The instructor explained one of the ways to do it, and recommended to rely on letter distribution in English. But while the suggested approach was interesting, I decided to try something different and see if it is possible to break the cipher without using the statistic distribution.
The analysis was based on the fact that this cipher cannot withstand the known plaintext attack. This means if the attacker has the original text and encoded text, recovering the key is trivial – by XORing the ciphertext against the original text you will get the key, which you can use to decrypt the rest of the text.
What it means for us? Let’s assume we know the key is 8 bytes and first word in the ciphertext was Hello. Since we know the ciphertext, recovering the first five bytes of the key would be trivial. And since the key repeats, we could just decode the first five bytes (at the offsets 0-5), then the next five bytes at the offsets 8-13 and so on. At least in one of those location we would guess three more characters based on context (such as longer words), and find out the remaining bytes.
The only problems we have here is that we do not know the key length, and we do not know what the first word is. Neither is a real problem, however.
Finding out the key length.
This will not always be the case, but finding the key length in this example is trivial due to the three things:
- Our knowledge that the encoded content is ASCII, meaning only lower 7 bits are used and the bit 8 (highest) is always unset in the plaintext;
- The key has bytes from 0-255, so some of the bytes have the bit 8 set, and some do not. This will not always be the case, but in this case it is;
- Therefore the bit 8 value in ciphertext (which is the result of XOR operation) would depend only on the bit 8 of the key, as the plaintext has this bit always unset.
What does this mean? This mean the following:
- If bit 8 in the key byte is unset, the bit 8 in the ciphertext would be unset;
- If bit 8 in the key byte is set, the bit 8 in the ciphertext would be set;
Therefore looking at the distribution of the bit 8 in the ciphertext we can see how the bit 8 was distributed in the key, and try to see if it follows any pattern.
Now let’s find out the value of the bit 8 for each byte in the ciphertext:
Position:00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Content: F9 6D E8 C2 27 A2 59 C8 7E E1 DA 2A ED 57 C9 3F E5 DA 36 ED 4E C8 7E F2 Bit 8: 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1
As you see, the distribution pattern is so specific that it takes only a glance to see how long the key is.
Finding the key
Now we know what the key length is, so if we knew the first 3-4 bytes of the context, this would make cracking the cipher trivial. Unfortunately we do not know them.
However we know the text is written in English. A “the” article is used frequently in English texts, and therefore we are guaranteed to have it in this text at least once. “The ” (which always follows by a space) gives us 4 bytes of plaintext, and we know it exists in the ciphertext. Remember the “hello” example from the above? Would it make it more difficult if you knew Hello string does not start at the offset 0? No, as long as you know the offset it wouldn’t be a problem. But what if you don’t know the offset?
If we don’t know the offset, we can guess it!
- Let’s assume the “the ” string is encoded at offset OFFSET (which starts from 0 to the end of plaintext);
- Then we generate the decryption key based on this assumption, i.e. the key which would decrypt the ciphertext at the offset OFFSET into “the “.
- If this key is the correct key, it would decrypt the ciphertext properly at the offset OFFSET (into “the “) but also at the offsets OFFSET + KEYLEN, OFFSET + KEYLEN * 2, OFFSET + KEYLEN * 3 and so on until the end of plaintext.
- If at any offset the decoding attempt results in the output which does not satisfy the plaintext conditions (i.e. the bytes which are ASCII and are not lower/uppercase English letters or symbols), we reject this key as invalid.
- If all ciphertext was decoded without errors, count the number of non-space symbols (such as dot and comma) in the deciphered text. If the number is abnormally high (my threshold was 4%), reject this key as invalid as well.
- If the key is invalid, we increase the OFFSET and start again from #2.
This algorithm has O(n2) complexity where n is the ciphertext length. So it is fairly effective in breaking the cipher.
Now, is this solution universal? Not really. The key length lookup, for example, would fail if the whole key had also the bit 8 unset. However it would be still possible to run the same search without a fixed key length. This would just take more time and attempts to try it with different key lengths.
Please note that this is a different example from finding out the AES encryption key which I have done before. That case was incorrect use of the otherwise secure cipher, while this cipher is insecure.
Is it right to infer from this that the key would include non ASCII characters too?Because if it is an ASCII char,then all 8th bits will be UNSET,then there couldnt be a single SET bit in ciphertext,which is clearly not the case.
Am I right in concluding that?This will reduce the possible values for the keys if it is so.
Yes, this is right to infer this, because if the key had only ASCII characters, there is no way the ciphertext would contain values such as F9.
Thanks for the reply.
I have understood the principle behind decrypting the text,but due to not knowing how to work with hex characters/strings in C specially loops(for,while) I am unable to form the code.Are there any resources or examples which I could refer to understand these things and develop the code?Even Examples in C# would do.
You do not need to implement it in C; I have implemented mine in Perl, for example. Regarding resources, there probably are, but I’m afraid if you struggle with loops and strings, it would be a much bigger challenge for you to finish this assignment.
I know the functioning of loops,problem is working with Hex strings.I am unable to figure out how do I take 2 characters at one time in the loop and use them as 1 hex character to perform the XOR.I guess we have to use access specifiers,but I am unfamiliar with them.
Which language are you working with? In C you can convert the two char hex string from buf into byte by using:
sscanf( buf, "%02X", &byte );
after finding the key lenght,i don’t understand well how you find the key and after finding the text …like for example on the example you gave the key is 7 if i am not mistaken…so what is the next step.thank you
It is all in the article. Sorry, I doubt I can describe it in more details than that. I suggest you review the lecture again wher Prof. Katz describes how to break the cipher, this knowledge is also necessary. Otherwise you can wait and I’ll post the scripts after the deadline.
hi!
I am trying the method in the lecture and the slides to try to find out the key length. Picking up nth characters, finding their frequencies and getting Σ q(i)^2 and finding the maximum
However I seem to be getting a peak at 2 :http://paste.debian.net/283963/
which is wrong. What part of it is flawed?
Help greatly appreciated. Thanks!
I suggest using the Coursera forums for help, as I used a different method, not related to character probabilities.