Cryptopals challenge 6: Break repeating-key XOR

This is my write up of the sixth Cryptopals challenge, using Python3 as my language of choice. The challenge:

Break repeating-key XOR

It is officially on, now.

This challenge isn’t conceptually hard, but it involves actual error-prone coding. The other challenges in this set are there to bring you up to speed. This one is there to qualify you. If you can do this one, you’re probably just fine up to Set 6.

There’s a file here. It’s been base64’d after being encrypted with repeating-key XOR.

Decrypt it.

Here’s how:

  1. Let KEYSIZE be the guessed length of the key; try values from 2 to (say) 40.
  2. Write a function to compute the edit distance/Hamming distance between two strings. The Hamming distance is just the number of differing bits. The distance between:

this is a test

and

wokka wokka!!!

is 37. Make sure your code agrees before you proceed.

  1. For each KEYSIZE, take the first KEYSIZE worth of bytes, and the second KEYSIZE worth of bytes, and find the edit distance between them. Normalize this result by dividing by KEYSIZE.
  2. The KEYSIZE with the smallest normalized edit distance is probably the key. You could proceed perhaps with the smallest 2-3 KEYSIZE values. Or take 4 KEYSIZE blocks instead of 2 and average the distances.
  3. Now that you probably know the KEYSIZE: break the ciphertext into blocks of KEYSIZE length.
  4. Now transpose the blocks: make a block that is the first byte of every block, and a block that is the second byte of every block, and so on.
  5. Solve each block as if it was single-character XOR. You already have code to do this.
  6. For each block, the single-byte XOR key that produces the best looking histogram is the repeating-key XOR key byte for that block. Put them together and you have the key.

This code is going to turn out to be surprisingly useful later on. Breaking repeating-key XOR (“Vigenere”) statistically is obviously an academic exercise, a “Crypto 101” thing. But more people “know how” to break it than can actually break it, and a similar technique breaks something much more important.

No, that’s not a mistake.

We get more tech support questions for this challenge than any of the other ones. We promise, there aren’t any blatant errors in this text. In particular: the “wokka wokka!!!” edit distance really is 37.

This challenge was tough for me. I wasn’t familiar with Hamming distance, and also got a bit confused in steps 3, 4, and 6. While researching my confusion, I found this blog post extremely helpful, and it probably saved me lots of frustration.

Solving this challenge took multiple steps: Read and decode the file contents, implement a loop to cycle through the potential keysize range, write a function to calculate Hamming distance, normalize the Hamming distance, break the ciphertext into chunks the length of the determined keysize and transpose the blocks, and then implement single-key XOR brute forcing.

Read and decode the file contents

with open('6.txt') as input_file:
    ciphertext = base64.b64decode(input_file.read())

Implement a loop to cycle through the potential keysize range
The challenge recommends guessing keysizes between 2 and 40, so I stuck with that:

# Take the keysize from suggested range
for keysize in range(2,41):

At this point I decided to tackle the Hamming distance function.

Calculate Hamming distance

As the challenge mentions, the Hamming distance is the number of differing bits. Let’s walk through how to do this manually.

First, let’s compare the binary for to equal length byte-strings, b’jake’ and ‘fire’:


>>> bin_1 = [bin(byte) for byte in b'jake']
>>> bin_2 = [bin(byte) for byte in b'fire']
>>> bin_1
['0b1101010', '0b1100001', '0b1101011', '0b1100101']
>>> bin_2
['0b1100110', '0b1101001', '0b1110010', '0b1100101']

An easy way to find the differences in the bits is to XOR these values:


>>> bytes_1 = [byte for byte in b'jake']
>>> bytes_2 = [byte for byte in b'fire']
>>> xor_bytes = [b1 ^ b2 for b1,b2 in zip(bytes_1, bytes_2)]
>>> xor_bytes
[12, 8, 25, 0]

So each of these now contain the total number of differing bits (the last number is 0, as the bytes compared were b’e’ and b’e’, which have no differences). Let’s count the ‘1’s:


>>> hamming_distance = 0
>>> for byte in xor_bytes:
...     hamming_distance += sum([1 for bit in bin(byte) if bit == '1'])
...
>>> hamming_distance
6

So, the Hamming distance between b’jake’ and b’fire’ is 6. Let’s try this again with the example values. If the logic is correct, the Hamming distance should be 37:


>>> bytes_1 = [byte for byte in b'this is a test']
>>> bytes_2 = [byte for byte in b'wokka wokka!!!']
>>> xor_bytes = [b1 ^ b2 for b1,b2 in zip(bytes_1, bytes_2)]
>>> hamming_distance = 0
>>> for byte in xor_bytes:
...     hamming_distance += sum([1 for bit in bin(byte) if bit == '1'])
...
>>> hamming_distance
37

Here’s what this looks like in a function:

Normalize the Hamming distance

The challenge states:

“For each KEYSIZE, take the first KEYSIZE worth of bytes, and the second KEYSIZE worth of bytes, and find the edit distance between them. Normalize this result by dividing by KEYSIZE”

When I did this initially I did it incorrectly. For each piece of ciphertext, I simply took a first and second keysize worth of bytes, ran it through the Hamming distance function, and then divided the results by the keysize. So for example if the cipher was asdfghjklzxcvbnm and my keysize was 2, I would take ‘as’ and ‘df’, send it to the Hamming distance function, divide the result by 2, and then move onto the next keysize and repeat. Where I went wrong was only taking the first and second worth of bytes. After some research, what I did instead was split the ciphertext up into keysize-size chunks, take the first two chunks and get the Hamming distance, and then repeat this process for all chunks. Here is what the code looks like:

Each of these distances were then further normalized by taking the average of the distances of the chunks. The results were then stored in a dictionary that was appended to a list that was initialized at the beginning of the function:


result = {
    'key': keysize,
    'avg distance': sum(distances) / len(distances)
    }
average_distances.append(result)

The list is then sorted by the keys that have the average distance, and the keylength with the lowest distance is kept:

possible_key_lengths = sorted(average_distances, key=lambda x: x['avg distance'])[0]

The whole purpose of this step is to find the correct keysize (or most likely keysize). If you are interested in another method to determine the potential keysize or how this method works to determine the keysize, take a read at the Kasiski examination (the alternate method for determining keysize) and Friedman test (what is implemented for this challenge) here.

Break the ciphertext into chunks the length of the determined keysize and transpose the blocks

It took me a while to figure out what I was supposed to do at this point. I decided to turn to a book that I previously worked through when learning Python, Hacking secret ciphers with Python, by Al Sweigart, which has a chapter dealing with hacking a Vigenere cipher. The purpose of this step (transposing the blocks), is to recover the key character by character. As an example, let’s look at encrypting a message with the same key in the previous challenge.


>>> message = b'the cat in the hat'
>>> key = b'ICE'

Since the key is only three characters, every 3rd character will be encrypted with the same key.

For example, the following highlighted characters will be encrypted with ‘I’:


>>> print("{}\n{}".format(message, key*6))
b'the cat in the hat'
b'ICEICEICEICEICEICE'

A longer message would have proved a better example, but if I were to single key XOR brute force a message made up of those highlighted characters, the key ‘I’ would be the most likely key candidate.

Here is how I implemented this:

As an example, here is how the block would be built if we transposed the message ‘the cat in the hat’ with a possible key length of three:


>>> message
b'the cat in the hat'
>>> possible_key_length = 3
>>> for i in range(possible_key_length):
...     block = b''
...     for j in range(i, len(message), possible_key_length):
...             block += bytes([message[j]])
...             print(block)
...
b't'
b't '
b't t'
b't tn'
b't tnh'
b't tnhh'
b'h'
b'hc'
b'hc '
b'hc  '
b'hc  e'
b'hc  ea'
b'e'
b'ea'
b'eai'
b'eait'
b'eait '
b'eait t'
>>>

If this isn’t clear, take a look in the aforementioned book at how a Vigenere cipher is created and cracked, and it should give you a better understanding of how and why the bytes are transposed in this way.

Implement single-key XOR brute forcing

I’ve already explained the code to do this in a previous post, so I won’t go through it again, but each block created in the previous step is ran against the single-key XOR brute forcing code. Here is the full script:


import base64


def get_english_score(input_bytes):
    """Compares each input byte to a character frequency 
    chart and returns the score of a message based on the
    relative frequency the characters occur in the English
    language.
    """

    # From https://en.wikipedia.org/wiki/Letter_frequency
    # with the exception of ' ', which I estimated.
    character_frequencies = {
        'a': .08167, 'b': .01492, 'c': .02782, 'd': .04253,
        'e': .12702, 'f': .02228, 'g': .02015, 'h': .06094,
        'i': .06094, 'j': .00153, 'k': .00772, 'l': .04025,
        'm': .02406, 'n': .06749, 'o': .07507, 'p': .01929,
        'q': .00095, 'r': .05987, 's': .06327, 't': .09056,
        'u': .02758, 'v': .00978, 'w': .02360, 'x': .00150,
        'y': .01974, 'z': .00074, ' ': .13000
    }
    return sum([character_frequencies.get(chr(byte), 0) for byte in input_bytes.lower()])


def single_char_xor(input_bytes, char_value):
    """Returns the result of each byte being XOR'd with a single value.
    """
    output_bytes = b''
    for byte in input_bytes:
        output_bytes += bytes([byte ^ char_value])
    return output_bytes


def bruteforce_single_char_xor(ciphertext):
    """Performs a singlechar xor for each possible value(0,255), and
    assigns a score based on character frequency. Returns the result
    with the highest score.
    """
    potential_messages = []
    for key_value in range(256):
        message = single_char_xor(ciphertext, key_value)
        score = get_english_score(message)
        data = {
            'message': message,
            'score': score,
            'key': key_value
            }
        potential_messages.append(data)
    return sorted(potential_messages, key=lambda x: x['score'], reverse=True)[0]


def break_repeating_key_xor(ciphertext):
    """Attempts to break repeating-key XOR encryption.
    """
    average_distances = []

    # Take the keysize from suggested range 
    for keysize in range(2,41):

        # Initialize list to store Hamming distances for this keysize 
        distances = []

        # Break the ciphertext into chunks the length of the keysize
        chunks = [ciphertext[i:i+keysize] for i in range(0, len(ciphertext), keysize)]
        
        while True:
            try:
                # Take the two chunks at the beginning of the list and 
                # get the Hamming distance 
                chunk_1 = chunks[0]
                chunk_2 = chunks[1]
                distance = calculate_hamming_distance(chunk_1, chunk_2)

                # Normalize this result by dividing by KEYSIZE
                distances.append(distance/keysize)

                # Remove these chunks so when the loop starts over, the
                # Hamming distance for the next two chunks can be calculated
                del chunks[0]
                del chunks[1]

            # When an exception occurs (indicating all chunks have 
            # been processed) break out of the loop.
            except Exception as e:
                break
        result = {
            'key': keysize,
            'avg distance': sum(distances) / len(distances)
            }
        average_distances.append(result)
    possible_key_lengths = sorted(average_distances, key=lambda x: x['avg distance'])[0]
    possible_plaintext = []

    # Will populate with a single character as each transposed 
    # block has been single-byte XOR brute forced
    key = b''
    possible_key_length = possible_key_lengths['key']
    for i in range(possible_key_length):
        
        # Creates an block made up of each nth byte, where n
        # is the keysize
        block = b''
        for j in range(i, len(ciphertext), possible_key_length):
            block += bytes([ciphertext[j]])
        key += bytes([bruteforce_single_char_xor(block)['key']]) 
    possible_plaintext.append((repeating_key_xor(ciphertext, key), key)) 
    return max(possible_plaintext, key=lambda x: get_english_score(x[0]))


def repeating_key_xor(message_bytes, key):
    """Returns message XOR'd with a key. If the message, is longer
    than the key, the key will repeat.
    """
    output_bytes = b''
    index = 0
    for byte in message_bytes:
        output_bytes += bytes([byte ^ key[index]])
        if (index + 1) == len(key):
            index = 0
        else:
            index += 1
    return output_bytes


def calculate_hamming_distance(input_bytes_1, input_bytes_2):
    """Finds and returns the Hamming distance (number of differing 
    bits) between two byte-strings
    """
    hamming_distance = 0
    for b1, b2 in zip(input_bytes_1, input_bytes_2):
        difference = b1 ^ b2

        # Count the number of differences ('1's) and add to the hamming distance
        hamming_distance += sum([1 for bit in bin(difference) if bit == '1'])
    return hamming_distance


def main():
    with open('6.txt') as input_file:
        ciphertext = base64.b64decode(input_file.read())
    result, key = break_repeating_key_xor(ciphertext)
    print("Key: {}\nMessage: {}".format(key, result))


if __name__ == '__main__':
    main()

Tool creation

I also went ahead and made this into a tool, which you can find on my GitHub. I made a few modification to the above code, mainly selecting the 5 shortest normalized edit distances and process each of them. The tool allows you to read the data from a file or as an argument to a parameter, as well as decode the input using Base64, hex, and/or URL encoding. Feedback is welcome.

2 comments

Leave a Reply

Your email address will not be published. Required fields are marked *