5.3 Searching Element Content

Let's tackle a slightly more complicated task: printing out all issues written by a certain author. This now requires looking at element content, because the writer's name is inside a writer element: <writer>Peter Milligan</writer>.

The search will be performed using the following algorithm:

  1. The startElement method will be more complicated. For comic elements, the handler has to save the title and number, in case this comic is later found to match the search criterion. For writer elements, it sets a inWriterContent flag to true, and sets a writerName attribute to the empty string.

  2. Characters outside of XML tags must be processed. When inWriterContent is true, these characters must be added to the writerName string.

  3. When the writer element is finished, we've now collected all of the element's content in the writerName attribute, so we can check if the name matches the one we're searching for, and if so, print the information about this comic. We must also set inWriterContent back to false.

Here's the first part of the code; this implements step 1.

from xml.sax import ContentHandler
import string

def normalize_whitespace(text):
    "Remove redundant whitespace from a string"
    return ' '.join(text.split())

class FindWriter(ContentHandler):
    def __init__(self, search_name):
        # Save the name we're looking for
        self.search_name = normalize_whitespace(search_name)

        # Initialize the flag to false
        self.inWriterContent = 0

    def startElement(self, name, attrs):
        # If it's a comic element, save the title and issue
        if name == 'comic':
            title = normalize_whitespace(attrs.get('title', ""))
            number = normalize_whitespace(attrs.get('number', ""))
            self.this_title = title
            self.this_number = number

        # If it's the start of a writer element, set flag
        elif name == 'writer':
            self.inWriterContent = 1
            self.writerName = ""

The startElement() method has been discussed previously. Now we have to look at how the content of elements is processed.

The normalize_whitespace() function is important, and you'll probably use it in your own code. XML treats whitespace very flexibly; you can include extra spaces or newlines wherever you like. This means that you must normalize the whitespace before comparing attribute values or element content; otherwise the comparison might produce an incorrect result due to the content of two elements having different amounts of whitespace.

    def characters(self, ch):
        if self.inWriterContent:
            self.writerName = self.writerName + ch

The characters() method is called for characters that aren't inside XML tags. ch is a string of characters. It is not necessarily a byte string; parsers may also provide a buffer object that is a slice of the full document, or they may pass Unicode objects.

You also shouldn't assume that all the characters are passed in a single function call. In the example above, there might be only one call to characters() for the string "Peter Milligan", or it might call characters() once for each character. Another, more realistic example: if the content contains an entity reference, as in "Wagner &amp; Seagle", the parser might call the method three times; once for "Wagner ", once for "&", represented by the entity reference, and again for " Seagle".

For step 2 of the algorithm, characters() only has to check inWriterContent, and if it's true, add the characters to the string being built up.

Finally, when the writer element ends, the entire name has been collected, so we can compare it to the name we're searching for.

    def endElement(self, name):
        if name == 'writer':
            self.inWriterContent = 0
            self.writerName = normalize_whitespace(self.writerName)
            if self.search_name == self.writerName:
                print 'Found:', self.this_title, self.this_number

To avoid being confused by differing whitespace, the normalize_whitespace() function is called. This can be done because we know that leading and trailing whitespace are insignificant for this application.

End tags can't have attributes on them, so there's no attrs parameter to the endElement() method. Empty elements with attributes, such as "<arc name="Season of Mists"/>", will result in a call to startElement(), followed immediately by a call to endElement().