5 SAX: The Simple API for XML

This HOWTO describes version 2 of SAX (also referred to as SAX2). Support is still present for SAX version 1, which is now only of historical interest; SAX1 will not be documented here.

SAX is most suitable for purposes where you want to read through an entire XML document from beginning to end, and perform some computation such as building a data structure or summarizing the contained information (computing an average value of a certain element, for example). SAX is not very convenient if you want to modify the document structure by changing how elements are nested, though it would be straightforward to write a SAX program that simply changed element contents or attributes. For example, you wouldn't want to re-order chapters in a book using SAX, but you might want to extract the contents of all name elements with the attribute lang set to 'greek'.

One advantage of SAX is speed and simplicity. Let's say you've defined a complicated DTD for listing comic books, and you wish to scan through your collection and list everything written by Neil Gaiman. For this specialized task, there's no need to expend effort examining elements for artists and editors and colourists, because they're irrelevant to the search. You can therefore write a class instance which ignores all elements that aren't writer.

Another advantage of SAX is that you don't have the whole document resident in memory at any one time, which matters if you are processing really huge documents.

SAX defines 4 basic interfaces. A SAX-compliant XML parser can be passed any objects that support these interfaces, and will call various methods as data is processed. Your task, therefore, is to implement those interfaces that are relevant to your application.

Interface	Purpose
`ContentHandler`	Called for general document events. This interface is the heart of SAX; its methods are called for the start of the document, the start and end of elements, and for the characters of data contained inside elements.
`DTDHandler`	Called to handle DTD events required for basic parsing. This means notation declarations (XML spec section 4.7) and unparsed entity declarations (XML spec section 4).
`EntityResolver`	Called to resolve references to external entities. If your documents will have no external entity references, you don't need to implement this interface.
`ErrorHandler`	Called for error handling. The parser will call methods from this interface to report all warnings and errors.

Python doesn't support the concept of interfaces, so the interfaces listed above are implemented as Python classes. The default method implementations are defined to do nothing--the method body is just a Python pass statement--so usually you can simply ignore methods that aren't relevant to your application.

# Define your specialized handler classes
from xml.sax import ContentHandler, ...
class docHandler(ContentHandler):
    ...

# Create an instance of the handler classes
dh = docHandler()

# Create an XML parser
parser = ...

# Tell the parser to use your handler instance
parser.setContentHandler(dh)

# Parse the file; your handler's methods will get called
parser.parse(sys.stdin)