Let's follow the earlier example of a comic book collection, using a simple DTD-less format. Here's a sample document for a collection consisting of a single issue:
<collection> <comic title="Sandman" number='62'> <writer>Neil Gaiman</writer> <penciller pages='1-9,18-24'>Glyn Dillon</penciller> <penciller pages="10-17">Charles Vess</penciller> </comic> </collection>
An XML document must have a single root element; this is the
"collection" element. It has one child comic
element
for each issue; the book's title and number are given as attributes of
the comic
element. The comic
element can in turn
contain several other elements such as writer
and
penciller
listing the writer and artists responsible for the
issue. There may be several artists or writers for a single issue.
Let's start off with something simple: a document handler named FindIssue that reports whether a given issue is in the collection.
from xml.sax import saxutils class FindIssue(saxutils.DefaultHandler): def __init__(self, title, number): self.search_title, self.search_number = title, number
The DefaultHandler class inherits from all four interfaces: ContentHandler, DTDHandler, EntityResolver, and ErrorHandler. This is what you should use if you want to just write a single class that wraps up all the logic for your parsing. You could also subclass each interface individually and implement separate classes for each purpose. Neither of the two approaches is always ``better'' than the other; mostly it's a matter of taste.
Since this class is doing a search, an instance needs to know what it's searching for. The desired title and issue number are passed to the FindIssue constructor, and stored as part of the instance.
Now let's override some of the parsing methods. This simple search only requires looking at the attributes of a given element, so only the startElement method is relevant.
def startElement(self, name, attrs): # If it's not a comic element, ignore it if name != 'comic': return # Look for the title and number attributes (see text) title = attrs.get('title', None) number = attrs.get('number', None) if (title == self.search_title and number == self.search_number): print title, '#' + str(number), 'found'
The startElement() method is passed a string giving the name of the element, and an instance containing the element's attributes. Attributes are accessed using methods from the AttributeList interface, which includes most of the semantics of Python dictionaries.
To summarize, the startElement() method looks for
comic
elements and compares the specified title
and number
attributes to the search values. If they
match, a message is printed out.
startElement() is called for every single element in the
document. If you added print 'Starting element:', name
to the
top of startElement(), you would get the following output.
Starting element: collection Starting element: comic Starting element: writer Starting element: penciller Starting element: penciller
To actually use the class, we need top-level code that creates instances of a parser and of FindIssue, associates the parser and the handler, and then calls a parser method to process the input.
from xml.sax import make_parser from xml.sax.handler import feature_namespaces if __name__ == '__main__': # Create a parser parser = make_parser() # Tell the parser we are not interested in XML namespaces parser.setFeature(feature_namespaces, 0) # Create the handler dh = FindIssue('Sandman', '62') # Tell the parser to use our handler parser.setContentHandler(dh) # Parse the input parser.parse(file)
The make_parser class can automate the job of creating parsers. There are already several XML parsers available to Python, and more might be added in future. xmllib.py is included as part of the Python standard library, so it's always available, but it's also not particularly fast. A faster version of xmllib.py is included in xml.parsers. The xml.parsers.expat module is faster still, so it's obviously a preferred choice if it's available. make_parser determines which parsers are available and chooses the fastest one, so you don't have to know what the different parsers are, or how they differ. (You can also tell make_parser to try a list of parsers, if you want to use a specific one).
Once you've created a parser instance, calling the setContentHandler() method tells the parser what to use as the content handler. There are similar methods for setting the other handlers: setDTDHandler(), setEntityResolver(), and setErrorHandler().
If you run the above code with the sample XML document, it'll print
Sandman #62 found.