Writing and fixing vulnerable XML parsing code in Python

A while back I coded a Python tool (located here) that converts an Nmap XML output file to a CSV file (discussed here) using Python’s xml.etree.ElementTree module as the parser. If you look at the documentation for Python’s XML parsing modules you may notice some text mentioning that the modules are not secure against malicious input. The following chart explains the issues with each of Python’s XML parsers:

You can look at the description of the issues here, but I want to take a closer look at the billion laughs and quadratic blowup attack, as I used the ElementTree module to parse the Nmap XML file.

The goal of each of these attacks is to cause a denial of service against the system that is parsing the XML. Both attacks make use of XML entities, so before going into the details of the individual attacks, it is necessary to have a basic understanding of XML entities.

Entities

For the purpose of this discussion, you can think of an entity as a type of variable. Here’s an example XML document (entity_example.xml) using the from_name entity with a value of Jake Miller:


<?xml version="1.0"?>
<!DOCTYPE Example [
<!--This is an entity, which will expand in the From and Signature nodes-->
<!ENTITY from_name "Jake Miller">
]>
<Email>
  <To>Team</To>
  <From>&from_name;</From>
  <Subj>Reminder</Subj>
  <Body>TPS reports due.</Body>
  <Signature>Thanks, &from_name;</Signature>
</Email>

Here is a simple Python script that will print out the XML tag names and their values:


# example parser.py
import xml.etree.ElementTree as etree

# Open the file and read into memory with etree
with open('entity_example.xml') as fh:
    tree = etree.parse(fh)

# Iterate through the XML nodes, printing the
# tag name and text values
for node in tree.iter():
    print(node.tag, node.text)

Output:


C:\ >python3 example_parser.py
Email

To Team
From Jake Miller
Subj Reminder
Body TPS reports due.
Signature Thanks, Jake Miller

As you can see, the from_name value, which was set to Jake Miller, was displayed each time the entity was called (&from_name;).

Billion Laughs

The billion laughs attack works by exponentially expanding entities. Here is the code example from Wikipedia:


<?xml version="1.0"?>
<!DOCTYPE lolz [
  <!ENTITY lol "lol">
  <!ELEMENT lolz (#PCDATA)>
  <!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
  <!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;">
  <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
  <!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">
  <!ENTITY lol5 "&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;">
  <!ENTITY lol6 "&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;">
  <!ENTITY lol7 "&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;">
  <!ENTITY lol8 "&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;">
  <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;">
]>
  <lolz>&lol9;</lolz>

The first entity, lol, is set to “lol”.  However, each of the other entities are defined to be 10 of another entity. When lol9 is called in the lolz tag, the entities begin to expand. After all the entity expansions have been processed, this block of XML will actually contain 109 “lol”s (hence the billion “laughs”), taking up almost 3 gigabytes of memory.

Just for fun, I attempted to use my Nmap parser script with this code as an input. It ended up freezing up my computer and I had to reboot.

Quadratic Blowup

The quadratic blowup is just a variation of the billion laughs, designed to bypass safeguards that look for nested entities. Instead of having references to other entities, the quadratic blowup attack just repeats the call to an entity over and over. For example:


<?xml version="1.0"?>
<!DOCTYPE QuadraticBlowup [
<!ENTITY x "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...">
]>
<Bomb>
&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;
&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;
&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;&x;
</Bomb>
</code

If you imagine that the defined entity x has even more ‘x’s, and that the entity is called even more in the Bomb tag, you can probably guess that this will suck up a lot of memory.

Fixing my Code

For me, the easiest fix was simply to not allow entities in the XML. My quick and dirty fix was to read the file look for the string ‘<!entity’ in the file contents prior to passing it to the XML parser. The fix looks like this:


def main():
for filename in args.filename:
    #...
    # Check for entities
    if not args.skip_entity_check:
        # Read the file and check for entities
        with open(filename) as fh:
            contents = fh.read()
            if '<!entity' in contents.lower():
                print("[-] Error! This program does not permit XML entities. Ignoring {}".format(filename))
                print("[*] Use -s (--skip_entity_check) to ignore this check for XML entities.")
                continue
</code

There are probably more elegant ways to fix this, but this seems to work fine for now. I added the -s option to skip this check in the case of having to parse output including an entity.

Leave a Reply

Your email address will not be published. Required fields are marked *