1.3 DTDs

Well-formedness just says that all tags nest properly and that every opening tag is matched by a closing tag. It says nothing about the order of elements or about which elements can be contained inside other elements.

The following XML, apparently representing a book, is well-formed but it doesn't match the structure expected for a book:

<book>
  <index>  ... </index>
  <chapter> ... </chapter>
  <chapter> ... </chapter>
  <abstract>  ... </abstract>
  <chapter> ... </chapter>
  <preface> ... </preface>
</book>

Prefaces don't come at the end of books, the index doesn't belong at the front, and the abstract doesn't belong in the middle. Well-formedness alone doesn't provide any way of enforcing that order. You could write a Python program that took an XML file like this and checked whether all the parts are in order, but then someone wanting to understand what documents are legal would have to read your program.

Document Type Definitions, or DTDs for short, are a more concise way of enforcing ordering and nesting rules. A DTD declares the element names that are allowed, and how elements can be nested inside each other. To take an example from HTML, the LI element, representing an entry in a list, can only occur inside certain elements which represent lists, such as OL or UL. The DTD also specifies the attributes that can be provided for each element, the default value for each attribute, and whether the attribute can be omitted. A validating parser can take a document and a DTD, and check whether the document is legal according to the DTD's rules. (The PyXML package includes a validating parser called xmlproc.)

DTDs are therefore an example of a schema language, a language for specifying a set of legal XML documents. Other applications want even stricter control over which documents are legal, and there are therefore stricter schema languages. XML Schema provides a type system and a number of basic types, so you can say that the value of an attribute must be a number or a date. RELAX NG is another schema language that provides more power and flexibility than XML Schema, but is simpler to read and implement.

Note that it's quite possible to get useful work done without using any schema language at all. You might decide that just writing well-formed XML and checking it with a Python program is all you need. There's no reason to drag in a schema language if it won't be useful.

Let's return to DTDs. A DTD lists the supported elements, the order in which elements must occur, and the possible attributes for each element. Here's a fragment from an imaginary DTD for writing books:

<!ELEMENT book (abstract?, preface, chapter*, appendix?)>
<!ELEMENT abstract ...>
<!ELEMENT chapter ...>
<!ATTLIST chapter id    ID    #REQUIRED 
                  title CDATA #IMPLIED>

The first line declares the book element, and specifies the elements that can occur inside it and the order in which the subelements must be provided. DTDs borrow from regular expression notation in order to express how elements can be repeated; "?"means an element must occur 0 or 1 times, "*" is 0 or more times, and "+" means the element must occur 1 or more times. For example, the above declarations imply that the abstract and appendix elements are optional inside a book element. Exactly one preface element has to be present, and it can be followed by any number of chapter elements; having no chapters at all would be legal.

The ATTLIST declaration specifies attributes for the chapter element. Chapters can have two attributes, id and title. title contains character data (CDATA) and is optional (that's what "#IMPLIED"means, for obscure historical reasons). id must contain an ID value, and it's required and not optional.

A validating parser could take this DTD and a sample document, and report whether the document is valid according to the rules of the DTD. A document is valid if all the elements occur in the right order, and in the right number of repetitions.