This post explores a method for determining whether data is encrypted, and introduces a tool to help analyze whether the data is encrypted.
I was recently testing a website that used encoded parameters in the requests. For example, clicking on a link to a document would send a request like:
Form data also had an encoded parameter:
POST /documents/submit hData= CRIwqt4%2BszDbqkNY%2BI0qbNXPg1XLaCM5etQ5Bt9DRFV%2FxIN2k8Go7jtArLIyP605b071DL8C%2BFPYSHOXPkMMMFPAKm%2BNsu0nCBMQVt9mlluHbVE%2Fyl6VaBCjNuOGvHZ9WYvt51uR%2FlklZZ0ObqD5UaC1rupZwCEK4pIWf6JQ4pTyPjyiPtKXg54FNQvbVIHeotUG2kHEvHGS%2Fw2Tt4E42xEwVfi29J3yp0O%2FTcL7aoRZIcJjMV4qxY%2FuvZLGsjo1%2FIyhtQp3vY0nSzJjGgaLYXpvRn8TaAcEtH3cqZenBooxBH3MxNjD%2FTVf3NastEWGnqeGp%2B0D9bQx%2F3L0%2BxTf%2Bk2VjBDrV9HPXNELRgPN0MlNo79p2gEwWjfTbx2KbF6htgsbGgCMZ6%2FiCshy3R8%2Fabxkl8eK%2FVfCGfA6bQQkqs91bgsT0RgxXSWzjjvh4eXTSl8xYoMDCGa2opN%2Fb6Q2MdfvW7rEvp5mwJOfQFDtkv4M5cFEO3sjmU9MReRnCpvalG3ark0XC589rm%2B42jC4%2FoFWUdwvkzGkSeoabAJdEJCifhvtGosYgvQDARUoNTQAO1%2BCbnwdKnA%2FWbQ59S9MU61QKcYSuk%2BjK5nAMDot2dPmvxZIeqbB6ax1IH0cdVx7qB%2FZ2FlJ%2FU927xGmC%2FRUFwoXQDRqL05L22wEiF85HKx2XRVB0F7keglwX%2Fkl4gga5rk3YrZ7VbInPpxUzgEaE4%2BBDoEqbv%2FrYMuaeOuBIkVchmzXwlpPORwbN0%2FRUL89xwOJKCQQZM8B1YsYOqeL3HGxKfpFo7kmArXSRKRHToXuBgDq07KS%2FjxaS1a1Paz%2FtvYHjLxwY0Ot3kS%2BcnBeq%2FFGSNL%2FfFV3J2a8eVvydsKat3XZS3WKcNNjY2ZEY1rHgcGL
I noticed that the parameters appeared to be Base64 then URL encoded, but decoding these strings didn’t provide anything of use:
Still, I wondered if the data was actually encrypted, or was it just compressed, using another encoding scheme, or maybe something else. Burp sequencer can help with this analysis, but I really wanted to tackle this problem manually so I could learn how to do it in Python, as well as have a tool that I could use away from web applications.
So how can you tell if data is encrypted? In theory, you shouldn’t be able to distinguish encrypted data from truly random data. Each byte of data should have an even chance of coming from the full range decimal range (0-255). So, given a large enough sample size you should have an even byte distribution across the entire range.
This is easy to see in a histogram. Here is the byte distribution of a small amount of data that has been AES encrypted:
As shown, bytes across the entire range are well represented.
English plaintext data looks much different when plotted. Here is an example:
As you can see, much less of the byte range is represented, with most of the bytes falling within the lowercase letter range (decimal 97-122 is lowercase a-z).
What about compressed data? Compressing data typically involves find ways to express redundancy. Here is what the histogram of the plaintext from above compressed into a ZIP file:
This looks pretty similar to encryption, with the exception of the uneven distribution of the bytes in the range of 0-25, which correspond to whitespace and control characters.
There’s also the possibility of a weak encryption/data manipulation scheme, such as an XOR. While the output will depend on the value of the key, here is an example of the same plaintext encrypted using repeating key XOR with the key ‘JAKE’:
So, going back to the parameters on the website. After gathering a decent sample size, I created a histogram of the data and it looked encrypted. While this was a disappointment, I was happy to create a tool that helps me answer the question of whether data is encrypted.
I also learned that plotting a histogram is pretty easy with Python:
>>> import matplotlib.pyplot as plt >>> message = b"the quick brown fox jumped over the lazy dog's back" >>> bytes_in_message = [byte for byte in message] >>> plt.hist(bytes_in_message) (array([10., 0., 0., 0., 0., 0., 0., 14., 14., 13.]), <snip> >>> plt.xticks(range(0, 256, 10)) ([<matplotlib.axis.XTick object at 0x00000211D4A23048>, <snip> >>> plt.show()
Of course, there are ways to style the graph (change the color, add labels, etc.), but basic plotting can be accomplished very easily.
The tool, check_byte_distribution.py, is available on my Github, and can make these pretty histograms, as well as scatter plots. Here is the basic usage:
usage: check_byte_distribution.py [-h] [-v] [-ph] [-ps] [-d DATA] [-f FILE] [-u] [-b] [-l] [-x] -h, --help show this help message and exit -v, --verbose Increase output verbosity -ph, --plot_histogram Plot as histogram -ps, --plot_scatter Plot as scatter -d DATA, --data DATA Specify the data as a string -f FILE, --file FILE Specify a file containing the data -u, --url_decode Decode the URL encoded characters -b, --b64_decode Decode the b64 encoded data -l, --line_by_line Checks entropy of data in a file line by line -x, --hex_decode Decode the hex encoded data
The data that you are interested in measuring can be specified by either the -d (–data) or -f (–file) arguments. If the data is encoded, the tool offers a few options (Base64, URL, hex) for decoding the data. The tool can also read a file line by line and check the byte count of a given line by using the -l (–line_by_line) switch. In cases when using this switch, you may not want a plot for each line, so the entropy of each line is determined by printing to the terminal the number of byte positions not represented. For example, encrypted data (that is Base64 encoded) will look like this:
python3 check_byte_distribution.py -f encrypted_data.txt -b [+] 0 bytes positions are not represented in the data
To plot this data to a histogram, simply add the -ph (–plot_histogram) switch:
python3 check_byte_distribution.py -f encrypted_data.txt -b -ph
Eventually I will also check the distribution of occurrence of the bytes. This tool is only lightly tested, so there may be some bugs. Feedback is welcome.