Cryptopals challenge 1 and implementing base64 encoding in Python

In an attempt to increase my knowledge of cryptography I’m working through the challenges at https://cryptopals.com with Python (Python3) as my language of choice. The first challenge isn’t too difficult, but I had fun completing it so I figured I’d write about it. This is the first challenge:

Convert hex to base64

The string:

49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d

Should produce:

SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t

So go ahead and make that happen. You’ll need to use this code for the rest of the exercises.

Cryptopals Rule

Always operate on raw bytes, never on encoded strings. Only use hex and base64 for pretty-printing.

Here is how I initially implemented the code:

I decided to do some extra credit and manually implement the base64 encoding. For background, I’ll summarize from the Base64 Wikipedia article: The main purpose of base64 is to transform binary data into a character set that are common to most encodings, and also printable. Traditionally, the character set used is A-Z, a-z, 0-9, +, and /, but this is not mandatory. Each base64 digit represents exactly 6 bits of data. Three 8-bit bytes (i.e., a total of 24 bits) can therefore be represented by four 6-bit base64 digits.

Wikipedia uses the example of encoding ‘Man’, which encodes to ‘TWFu’. I’ll walk through it with Python:

“…the encoded value of Man is TWFu. Encoded in ASCII, the characters M, a, and n are stored as the bytes 77, 97, and 110…

>>> [byte for byte in b'Man']
[77, 97, 110]

“…which are the 8-bit binary values 01001101, 01100001, and 01101110.”

>>> [bin(byte) for byte in b'Man']
['0b1001101', '0b1100001', '0b1101110']

The leading zeros are automatically removed, but we’ll add them back, as well as remove the ‘0b’ prefix.

“These three values are joined together into a 24-bit string, producing 010011010110000101101110.”

>>> ''.join([bin(byte)[2:].zfill(8) for byte in b'Man'])
'010011010110000101101110'

Groups of 6 bits (6 bits have a maximum of 26 = 64 different binary values) are converted into individual numbers from left to right (in this case, there are four numbers in a 24-bit string), which are then converted into their corresponding Base64 character values.”

Here is the Base64 index table (from Wikipedia):

The table is implemented like this:

>>> b64_index_table = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"

Now let’s take the first six bits:

>>> bits = ''.join([bin(byte)[2:].zfill(8) for byte in b'Man'])
>>> first_6 = bits[:6]
>>> int(first_6, 2)
19
>>> b64_index_table[19]
'T'

Now the next 6:

>>> next_6 = bits[6:12]
>>> int(next_6, 2)
22
>>> b64_index_table[22]
'W'

Continue the process and you end up with ‘TWFu’, as expected. ‘Man’ encodes cleanly, since its 24 bits are evenly divisible by each 6-bit base64 character. But, when you have a value that isn’t divisible by six, one or two ‘=’ characters may be added to the end of the encoded string to make the bits divisible my six:

Here is my revised solution to this initial challenge, with my implementation of the base64 encoding:


b64_index_table = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"


def b64_encode(input_bytes):
    """Implements base64 encoding.
    """
    # Initialize variable that will store the base64 encoded string
    encoded_output = ''

    # Takes each input character and converts to binary, eventually creating
    # a list of ones and zeroes (1, 1, 1, 0, 0, 1, 1, 0)
    bit_list = list(''.join([bin(byte)[2:].zfill(8) for byte in input_bytes]))
    
    # Break the list smaller lists, 6 bits long. For example, [1,1,1,1,1,1,0,0]
    # becomes [[1,1,1,1,1,1] [0,0]]
    chunks = [bit_list[i:i+6] for i in range(0, len(bit_list), 6)]
    
    for chunk in chunks:
        # Joins the chunk so it can be converted to an integer
        chunk = ''.join(chunk)

        # Checks the length of the chunk, adding trailing zeroes, mapping the
        # value to the b64_index_table, and adding '=' characters as necessary.
        if len(chunk) == 2:
            if '1' in chunk:
                chunk += '0000'
                encoded_output += b64_index_table[int(chunk, 2)] + '=='
            else:
                encoded_output += '=='
        elif len(chunk) == 4: 
            if '1' in chunk:
                chunk += '00'
                encoded_output += b64_index_table[int(chunk, 2)] + '='
            else:
                encoded_output += '='
        elif len(chunk) == 6:
            encoded_output += b64_index_table[int(chunk, 2)]
    return encoded_output


def main():
    string = '49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d'
    byte_string = bytes.fromhex(string)
    print(b64_encode(byte_string))


if __name__ == '__main__':
    main()

Leave a Reply

Your email address will not be published. Required fields are marked *