5.1 Starting Out

Let's follow the earlier example of a comic book collection, using a simple DTD-less format. Here's a sample document for a collection consisting of a single issue:

<collection>
  <comic title="Sandman" number='62'>
    <writer>Neil Gaiman</writer>
    <penciller pages='1-9,18-24'>Glyn Dillon</penciller>
    <penciller pages="10-17">Charles Vess</penciller>
  </comic>
</collection>

An XML document must have a single root element; this is the "collection" element. It has one child comic element for each issue; the book's title and number are given as attributes of the comic element. The comic element can in turn contain several other elements such as writer and penciller listing the writer and artists responsible for the issue. There may be several artists or writers for a single issue.

Let's start off with something simple: a document handler named FindIssue that reports whether a given issue is in the collection.

from xml.sax import saxutils

class FindIssue(saxutils.DefaultHandler):
    def __init__(self, title, number):
        self.search_title, self.search_number = title, number

The DefaultHandler class inherits from all four interfaces: ContentHandler, DTDHandler, EntityResolver, and ErrorHandler. This is what you should use if you want to just write a single class that wraps up all the logic for your parsing. You could also subclass each interface individually and implement separate classes for each purpose. Neither of the two approaches is always ``better'' than the other; mostly it's a matter of taste.

Since this class is doing a search, an instance needs to know what it's searching for. The desired title and issue number are passed to the FindIssue constructor, and stored as part of the instance.

Now let's override some of the parsing methods. This simple search only requires looking at the attributes of a given element, so only the startElement method is relevant.

    def startElement(self, name, attrs):
        # If it's not a comic element, ignore it
        if name != 'comic': return

        # Look for the title and number attributes (see text)
        title = attrs.get('title', None)
        number = attrs.get('number', None)
        if (title == self.search_title and 
	    number == self.search_number):
            print title, '#' + str(number), 'found'

The startElement() method is passed a string giving the name of the element, and an instance containing the element's attributes. Attributes are accessed using methods from the AttributeList interface, which includes most of the semantics of Python dictionaries.

To summarize, the startElement() method looks for comic elements and compares the specified title and number attributes to the search values. If they match, a message is printed out.

startElement() is called for every single element in the document. If you added print 'Starting element:', name to the top of startElement(), you would get the following output.

Starting element: collection
Starting element: comic
Starting element: writer
Starting element: penciller
Starting element: penciller

To actually use the class, we need top-level code that creates instances of a parser and of FindIssue, associates the parser and the handler, and then calls a parser method to process the input.

from xml.sax import make_parser
from xml.sax.handler import feature_namespaces

if __name__ == '__main__':
    # Create a parser
    parser = make_parser()

    # Tell the parser we are not interested in XML namespaces
    parser.setFeature(feature_namespaces, 0)

    # Create the handler
    dh = FindIssue('Sandman', '62')

    # Tell the parser to use our handler
    parser.setContentHandler(dh)

    # Parse the input
    parser.parse(file)

The make_parser class can automate the job of creating parsers. There are already several XML parsers available to Python, and more might be added in future. xmllib.py is included as part of the Python standard library, so it's always available, but it's also not particularly fast. A faster version of xmllib.py is included in xml.parsers. The xml.parsers.expat module is faster still, so it's obviously a preferred choice if it's available. make_parser determines which parsers are available and chooses the fastest one, so you don't have to know what the different parsers are, or how they differ. (You can also tell make_parser to try a list of parsers, if you want to use a specific one).

Once you've created a parser instance, calling the setContentHandler() method tells the parser what to use as the content handler. There are similar methods for setting the other handlers: setDTDHandler(), setEntityResolver(), and setErrorHandler().

If you run the above code with the sample XML document, it'll print Sandman #62 found.