Cryptopals challenge 4: Detect single-character XOR encryption

This is my write up of the fourth Cryptopals challenge, using Python3 as my language of choice. The challenge:

Detect single-character XOR

One of the 60-character strings in this file has been encrypted by single-character XOR.

Find it.

(Your code from #3 should help.)

I believe there are two ways to solve this challenge: 1) Read the file line be line and detect the string that is single-byte XOR encrypted, then pass that single string to a function that does a single-byte XOR brute force; or 2) Read the file and perform a single-byte XOR brute force on each string, selecting the string with the highest English frequency score. I solved the challenge both ways:

Detect the string in the file that is single-byte XOR encrypted

I had previous written about a tool I made to help detect encryption by analyzing byte positions, so I attempted to use that for this challenge. Unfortunately, by looking at each line in the file and looking at the number of missing bytes, the overall entropy of each line looked pretty much the same:

I decided to put all of these values in a list and sort the list to check for outliers. Here is how this looks in Python:

# Create a list to store the line number and entropy results for each line
distribution_list = []

# Iterate and analyze each line in the file
for item in data:
    
    # Store the results in a dictionary
    results = {"line": line_number, "missing bytes": missing_byte_count}

    # Append the dictionary to the list created outside the for loop
    distribution_list.append(results)

    # Sort the distribution_variable by the key ‘missing bytes’ in the results dictionary,
    # only providing the top result
    most_missing = sorted(distribution_list, key=lambda x: x['missing bytes'], reverse=True)[0]
    least_missing = sorted(distribution_list, key=lambda x: x['missing bytes'])[0]

When I added this code to my script, it showed me this:

Most missing bytes:
Line: 171
Num missing bytes: 237

Least missing bytes:
Line: 41
Num missing bytes: 226

It turns out, the line in the file with the most missing bytes was the one that was single-byte XOR encrypted. I can’t promise that will work in the future, but looking for outliers is typically a good step in analyzing most things. I took line 171 of the file, performed a single-byte XOR brute force, and recovered the plaintext:

>>> ct = bytes.fromhex('7b5a4215415d544115415d5015455447414c155c46155f4058455c5b523f')
>>> def single_char_xor(input_bytes, char_value):
...     """Returns the result of each byte being XOR'd with a single value.
...     """
...     output_bytes = b''
...     for byte in input_bytes:
...         output_bytes += bytes([byte ^ char_value])
...     return output_bytes
...
>>> for i in range(0,256):
...     single_char_xor(ct, i)
...
b'{ZB\x15A]TA\x15A]P\x15ETGAL\x15\\F\x15_@XE\\[R?'
b'z[C\x14@\\U@\x14@\\Q\x14DUF@M\x14]G\x14^AYD]ZS>'
b'yX@\x17C_VC\x17C_R\x17GVECN\x17^D\x17]BZG^YP='
b'xYA\x16B^WB\x16B^S\x16FWDBO\x16_E\x16\\C[F_XQ<'
b'\x7f^F\x11EYPE\x11EYT\x11APCEH\x11XB\x11[D\\AX_V;'
b'~_G\x10DXQD\x10DXU\x10@QBDI\x10YC\x10ZE]@Y^W:'
<snip>
b"Ihp'sofs'sob'wfus~'nt'mrjwni`\r"
b'Hiq&rngr&rnc&vgtr\x7f&ou&lskvoha\x0c'
b'Onv!ui`u!uid!q`sux!hr!ktlqhof\x0b'
b'Now that the party is jumping\n'
b'Mlt#wkbw#wkf#sbqwz#jp#ivnsjmd\t'
b'Lmu"vjcv"vjg"rcpv{"kq"hworkle\x08'
b'Cbz-yely-yeh-}l\x7fyt-d~-gx`}dcj\x07'
b'Bc{,xdmx,xdi,|m~xu,e\x7f,fya|ebk\x06'
b'A`x/{gn{/{gj/\x7fn}{v/f|/ezb\x7ffah\x05'
<snip>

Perform a single-byte XOR brute force on each string

This method is probably the easier way, and is basically just an extension of the previous challenge. For details of how the functions work, see my <need link previous writeup> about brute forcing the single-character XOR and implementing an English scoring function. For this method, the plan is to read in all the strings, brute force the single-character XOR, and then pick the result with the highest score.

Here’s the code:

def get_english_score(input_bytes):
    """Compares each input byte to a character frequency 
    chart and returns the score of a message based on the
    relative frequency the characters occur in the English
    language.
    """

    # From https://en.wikipedia.org/wiki/Letter_frequency
    # with the exception of ' ', which I estimated.
    character_frequencies = {
        'a': .08167, 'b': .01492, 'c': .02782, 'd': .04253,
        'e': .12702, 'f': .02228, 'g': .02015, 'h': .06094,
        'i': .06094, 'j': .00153, 'k': .00772, 'l': .04025,
        'm': .02406, 'n': .06749, 'o': .07507, 'p': .01929,
        'q': .00095, 'r': .05987, 's': .06327, 't': .09056,
        'u': .02758, 'v': .00978, 'w': .02360, 'x': .00150,
        'y': .01974, 'z': .00074, ' ': .13000
    }
    return sum([character_frequencies.get(chr(byte), 0) for byte in input_bytes.lower()])


def single_char_xor(input_bytes, char_value):
    """Returns the result of each byte being XOR'd with a single value.
    """
    output_bytes = b''
    for byte in input_bytes:
        output_bytes += bytes([byte ^ char_value])
    return output_bytes


def bruteforce_single_char_xor(ciphertext):
    """Performs a singlechar xor for each possible value(0,255), and
    assigns a score based on character frequency. Returns the result
    with the highest score.
    """
    potential_messages = []
    for key_value in range(256):
        message = single_char_xor(ciphertext, key_value)
        score = get_english_score(message)
        data = {
            'message': message,
            'score': score,
            'key': key_value
            }
        potential_messages.append(data)
    return sorted(potential_messages, key=lambda x: x['score'], reverse=True)[0]


def main():
    ciphers = open('set_1_exer_4.txt').read().splitlines()
    potential_plaintext = []
    for hexstring in ciphers:
        ciphertext = bytes.fromhex(hexstring)
        potential_plaintext.append(bruteforce_single_char_xor(ciphertext))
    best_score = sorted(potential_plaintext, key=lambda x: x['score'], reverse=True)[0]
    for item in best_score:
        print("{}: {}".format(item.title(), best_score[item]))
        

if __name__ == '__main__':
    main()

This is very similar to the solution to the previous challenge, except that I’ve refactored code that was previously in the main() function and implemented it in the function bruteforce_single_char_xor().

Leave a Reply

Your email address will not be published. Required fields are marked *