This HOWTO describes version 2 of SAX (also referred to as SAX2). Support is still present for SAX version 1, which is now only of historical interest; SAX1 will not be documented here.
SAX is most suitable for purposes where you want to read through an
entire XML document from beginning to end, and perform some
computation such as building a data structure or summarizing the
contained information (computing an average value of a certain
element, for example). SAX is not very convenient if you want to
modify the document structure by changing how elements are nested,
though it would be straightforward to write a SAX program that simply
changed element contents or attributes. For example, you wouldn't
want to re-order chapters in a book using SAX, but you might want to
extract the contents of all name
elements with the attribute
lang
set to 'greek'.
One advantage of SAX is speed and simplicity. Let's say
you've defined a complicated DTD for listing comic books, and you wish
to scan through your collection and list everything written by Neil
Gaiman. For this specialized task, there's no need to expend effort
examining elements for artists and editors and colourists, because
they're irrelevant to the search. You can therefore write a class
instance which ignores all elements that aren't writer
.
Another advantage of SAX is that you don't have the whole document resident in memory at any one time, which matters if you are processing really huge documents.
SAX defines 4 basic interfaces. A SAX-compliant XML parser can be passed any objects that support these interfaces, and will call various methods as data is processed. Your task, therefore, is to implement those interfaces that are relevant to your application.
The SAX interfaces are:
Interface | Purpose |
---|---|
ContentHandler |
Called for general document events. This interface is the heart of SAX; its methods are called for the start of the document, the start and end of elements, and for the characters of data contained inside elements. |
DTDHandler |
Called to handle DTD events required for basic parsing. This means notation declarations (XML spec section 4.7) and unparsed entity declarations (XML spec section 4). |
EntityResolver |
Called to resolve references to external entities. If your documents will have no external entity references, you don't need to implement this interface. |
ErrorHandler |
Called for error handling. The parser will call methods from this interface to report all warnings and errors. |
Python doesn't support the concept of interfaces, so the interfaces
listed above are implemented as Python classes. The default method
implementations are defined to do nothing--the method body is just a
Python pass
statement--so usually you can simply ignore methods
that aren't relevant to your application.
Pseudo-code for using SAX looks something like this:
# Define your specialized handler classes from xml.sax import ContentHandler, ... class docHandler(ContentHandler): ... # Create an instance of the handler classes dh = docHandler() # Create an XML parser parser = ... # Tell the parser to use your handler instance parser.setContentHandler(dh) # Parse the file; your handler's methods will get called parser.parse(sys.stdin)