6 DOM: The Document Object Model

With SAX you write a class which then gets the entire document poured through it as a sequence of method calls. An alternative approach is that taken by the Document Object Model, or DOM, which turns an XML document into a tree that's fully resident in memory.

A top-level Document instance is the root of the tree, and has a single child which is the top-level Element instance; this Element has child nodes representing the content and any sub-elements, which may in turn have further children and so forth. There are different classes for everything that can be found in an XML document, so in addition to the Element class, there are also classes such as Text, Comment, CDATASection, EntityReference, and so on. Nodes have methods for accessing the parent and child nodes, accessing element and attribute values, insert and delete nodes, and converting the tree back into XML.

The DOM is often useful for modifying XML documents, because you can create a DOM tree, modify it by adding new nodes and moving subtrees around, and then produce a new XML document as output. On the other hand, while the DOM doesn't require that the entire tree be resident in memory at one time, the Python DOM implementation currently keeps the whole tree in RAM. This means you may not have enough memory to process very large documents as a DOM tree. A SAX handler, on the other hand, can potentially churn through amounts of data far larger than the available RAM.

This HOWTO can't be a complete introduction to the Document Object Model, because there are lots of interfaces and lots of methods. Luckily, the DOM Recommendation is quite readable, so I'd recommend that you read it to get a complete picture of the available interfaces. This section will only be a partial overview.


Subsections