Let's tackle a slightly more complicated task: printing out all issues
written by a certain author. This now requires looking at element
content, because the writer's name is inside a writer
element: <writer>Peter Milligan</writer>
.
The search will be performed using the following algorithm:
comic
elements, the handler has to save the title and
number, in case this comic is later found to match the search
criterion. For writer
elements, it sets a
inWriterContent
flag to true, and sets a writerName
attribute to the empty string.
inWriterContent
is true, these characters must be added to the
writerName
string.
writer
element is finished, we've now
collected all of the element's content in the writerName
attribute, so we can check if the name matches the one we're searching
for, and if so, print the information about this comic. We must also
set inWriterContent
back to false.
Here's the first part of the code; this implements step 1.
from xml.sax import ContentHandler import string def normalize_whitespace(text): "Remove redundant whitespace from a string" return ' '.join(text.split()) class FindWriter(ContentHandler): def __init__(self, search_name): # Save the name we're looking for self.search_name = normalize_whitespace(search_name) # Initialize the flag to false self.inWriterContent = 0 def startElement(self, name, attrs): # If it's a comic element, save the title and issue if name == 'comic': title = normalize_whitespace(attrs.get('title', "")) number = normalize_whitespace(attrs.get('number', "")) self.this_title = title self.this_number = number # If it's the start of a writer element, set flag elif name == 'writer': self.inWriterContent = 1 self.writerName = ""
The startElement() method has been discussed previously. Now we have to look at how the content of elements is processed.
The normalize_whitespace() function is important, and you'll probably use it in your own code. XML treats whitespace very flexibly; you can include extra spaces or newlines wherever you like. This means that you must normalize the whitespace before comparing attribute values or element content; otherwise the comparison might produce an incorrect result due to the content of two elements having different amounts of whitespace.
def characters(self, ch): if self.inWriterContent: self.writerName = self.writerName + ch
The characters() method is called for characters that aren't inside XML tags. ch is a string of characters. It is not necessarily a byte string; parsers may also provide a buffer object that is a slice of the full document, or they may pass Unicode objects.
You also shouldn't assume that all the characters are passed in a single function call. In the example above, there might be only one call to characters() for the string "Peter Milligan", or it might call characters() once for each character. Another, more realistic example: if the content contains an entity reference, as in "Wagner & Seagle", the parser might call the method three times; once for "Wagner ", once for "&", represented by the entity reference, and again for " Seagle".
For step 2 of the algorithm, characters() only has to
check inWriterContent
, and if it's true, add the characters to
the string being built up.
Finally, when the writer
element ends, the entire name has
been collected, so we can compare it to the name we're searching for.
def endElement(self, name): if name == 'writer': self.inWriterContent = 0 self.writerName = normalize_whitespace(self.writerName) if self.search_name == self.writerName: print 'Found:', self.this_title, self.this_number
To avoid being confused by differing whitespace, the normalize_whitespace() function is called. This can be done because we know that leading and trailing whitespace are insignificant for this application.
End tags can't have attributes on them, so there's no attrs parameter to the endElement() method. Empty elements with attributes, such as "<arc name="Season of Mists"/>", will result in a call to startElement(), followed immediately by a call to endElement().