XML
XML означает Extensible Markup Language и относится к той же группе обьектов,
что и XSLT, XPath, XLink, WDDX и т.д.
В XML используются теговая структура,аналогично тому,как это делается в HTML.
XML позволяет использовать собственные теги.
Данные в формате XML нужно трансформировать во что-то более читаемое.
Для этого используется XSL - Extensible Style Language.
Большинство броузеров не имеют встроенных XML-парсеров или XSL-процессоров.
Решение в том,чтобы написать промежуточный слой между клиентом и сервером,
который распарсит XML и вернет читабельный вывод.
Тут на помощь приходит перл - он поддерживает XML-парсинг
(DOM и XML packages), и может делать XSL-преобразование (Sablotron processor).
В перле есть 2 метода для парсинга XML.
The first of these approaches is SAX, the Simple API for XML. A SAX
parser works by traversing an XML document and calling specific
functions as it encounters different types of tags. For example, I
might call a specific function to process a starting tag, another
function to process an ending tag, and a third function to process the
data between them.
The parser's responsibility is simply to parse the document; the
functions it calls are responsible for processing the tags found. Once
the tag is processed, the parser moves on to the next element in the
document, and the process repeats itself.
Perl comes with a SAX parser based on the expat library created by
James Clark; it's implemented as a Perl package named XML::Parser, and
currently maintained by Clark Cooper. If you don't already have it, you
should download and install it before proceeding further; you can get a
copy from http://wwwx.netheaven.com/~coopercc/xmlparser/, or from CPAN (http://www.cpan.org/).
I'll begin by putting together a simple XML file:
<?xml version="1.0"?>
<library>
<book>
<title>Dreamcatcher</title>
<author>Stephen King</author>
<genre>Horror</genre>
<pages>899</pages>
<price>23.99</price>
<rating>5</rating>
</book>
<book>
<title>Mystic River</title>
<author>Dennis Lehane</author>
<genre>Thriller</genre>
<pages>390</pages>
<price>17.49</price>
<rating>4</rating>
</book>
<book>
<title>The Lord Of The Rings</title>
<author>J. R. R. Tolkien</author>
<genre>Fantasy</genre>
<pages>3489</pages>
<price>10.99</price>
<rating>5</rating>
</book>
</library>
Once my data is in XML-compliant format, I need to decide what I'd like the final output to look like.
Let's say I want it to look like this:
As you can see, this is a simple table containing columns for the book
title, author, price and rating. (I'm not using all the information in
the XML file). The title of the book is printed in italics, while the
numerical rating is converted into something more readable.
Next, I'll write some Perl code to take care of this for me.
The first order of business is to initialize the XML parser, and set up the callback functions.
#!/usr/bin/perl
# include package
use XML::Parser;
# initialize parser
$xp = new XML::Parser();
# set callback functions
$xp->setHandlers(Start => \&start, End => \&end, Char => \&cdata);
# parse XML
$xp->parsefile("library.xml");
The parser is initialized in the ordinary way - by instantiating a new
object of the Parser class. This object is assigned to the variable
$xp, and is used in subsequent function calls.
# initialize parser
$xp = new XML::Parser();
The next step is to specify the functions to be executed when the
parser encounters the opening and closing tags of an element. The
setHandlers() method is used to specify these functions; it accepts a
hash of values, with keys containing the events to watch out for, and
values indicating which functions to trigger.
# set callback functions
$xp->setHandlers(Start => \&start, End => \&end, Char => \&cdata);
In this case, the user-defined functions start() and end() are called
when starting and ending element tags are encountered, while character
data triggers the cdata() function.
Obviously, these aren't the only types of events a parser can be set up
to handle - the XML::Parser package allows you to specify handlers for
a diverse array of events; I'll discuss these briefly a little later.
The next step in the script above is to open the XML file, read it and
parse it via the parsefile() method. The parsefile() method will
iterate through the XML document, calling the appropriate handling
function each time it encounters a specific data type.
# parse XML
$xp->parsefile("library.xml");
In case your XML data is not stored in a file, but in a string variable
- quite likely if, for example, you've generated it dynamically from a
database - you can replace the parsefile() method with the parse()
method, which accepts a string variable containing the XML document,
rather than a filename.
Once the document has been completely parsed, the script will proceed
to the next line (if there is one), or terminate gracefully. A parse
error - for example, a mismatched tag or a badly-nested element - will
cause the script to die immediately.
As you can see, this is fairly simple - simpler, in fact, than the
equivalent process in other languages like PHP or Java. Don't get
worried, though - this simplicity conceals a fair amount of power.
As I've just explained, the start(), end() and cdata() functions will
be called by the parser as it progresses through the document. We
haven't defined these yet - let's do that next:
# keep track of which tag is currently being processed
$currentTag = "";
# this is called when a start tag is found
sub start()
{
# extract variables
my ($parser, $name, %attr) = @_;
$currentTag = lc($name);
if ($currentTag eq "book")
{
print "<tr>";
}
elsif ($currentTag eq "title")
{
print "<td>";
}
elsif ($currentTag eq "author")
{
print "<td>";
}
elsif ($currentTag eq "price")
{
print "<td>";
}
elsif ($currentTag eq "rating")
{
print "<td>";
}
}
Each time the parser encounters a starting tag, it calls start() with
the name of the tag (and attributes, if any) as arguments. The start()
function then processes the tag, printing corresponding HTML markup in
place of the XML tag.
I've used an "if" statement, keyed on the tag name, to decide how to
process each tag. For example, since I know that <book> indicates
the beginning of a new row in my desired output, I replace it with a
<tr>, while other elements like <title> and <author>
correspond to table cells, and are replaced with <td> tags.
In case you're wondering, I've used the lc() function to convert the
tag name to lowercase before performing the comparison; this is
necessary to enforce consistency and to ensure that the script works
with XML documents that use upper-case or mixed-case tags.
Finally, I've also stored the current tag name in the global variable
$currentTag - this can be used to identify which tag is being processed
at any stage, and it'll come in useful a little further down.
The end() function takes care of closing tags, and looks similar to
start() - note that I've specifically cleaned up $currentTag at the end.
# this is called when an end tag is found
sub end()
{
my ($parser, $name) = @_;
$currentTag = lc($name);
if ($currentTag eq "book")
{
print "</tr>";
}
elsif ($currentTag eq "title")
{
print "</td>";
}
elsif ($currentTag eq "author")
{
print "</td>";
}
elsif ($currentTag eq "price")
{
print "</td>";
}
elsif ($currentTag eq "rating")
{
print "</td>";
}
# clear value of current tag
$currentTag = "";
}
Note that empty elements generate both start and end events.
So this takes care of replacing XML tags with corresponding HTML tags...but what about handling the data between them?
# this is called when CDATA is found
sub cdata()
{
my ($parser, $data) = @_;
my @ratings = ("Words fail me!", "Terrible", "Bad", "Indifferent", "Good", "Excellent");
if ($currentTag eq "title")
{
print "<i>$data</i>";
}
elsif ($currentTag eq "author")
{
print $data;
}
elsif ($currentTag eq "price")
{
print "\$$data";
}
elsif ($currentTag eq "rating")
{
print $ratings[$data];
}
}
The cdata() function is called whenever the parser encounters data
between an XML tag pair. Note, however, that the function is only
passed the data as argument; there is no way of telling which tags are
around it. However, since the parser processes XML chunk-by-chunk, we
can use the $currentTag variable to identify which tag this data
belongs to.
Depending on the value of $currentTag, an "if" statement is used to
print data with appropriate formatting; this is the place where I add
italics to the title, a currency symbol to the price, and a text rating
(corresponding to a numerical index) from the @ratings array.
Here's what the finished script (with some additional HTML, so that you can use it via CGI) looks like:
#!/usr/bin/perl
# include package
use XML::Parser;
# initialize parser
$xp = new XML::Parser();
# set callback functions
$xp->setHandlers(Start => \&start, End => \&end, Char => \&cdata);
# keep track of which tag is currently being processed
$currentTag = "";
# send standard header to browser
print "Content-Type: text/html\n\n";
# set up HTML page
print "<html><head></head><body>";
print "<h2>The Library</h2>";
print "<table border=1 cellspacing=1 cellpadding=5>";
print "<tr><td align=center>Title</td><td
align=center>Author</td><td
align=center>Price</td><td align=center>User
Rating</td></tr>";
# parse XML
$xp->parsefile("library.xml");
print "</table></body></html>";
# this is called when a start tag is found
sub start()
{
# extract variables
my ($parser, $name, %attr) = @_;
$currentTag = lc($name);
if ($currentTag eq "book")
{
print "<tr>";
}
elsif ($currentTag eq "title")
{
print "<td>";
}
elsif ($currentTag eq "author")
{
print "<td>";
}
elsif ($currentTag eq "price")
{
print "<td>";
}
elsif ($currentTag eq "rating")
{
print "<td>";
}
}
# this is called when CDATA is found
sub cdata()
{
my ($parser, $data) = @_;
my @ratings = ("Words fail me!", "Terrible", "Bad", "Indifferent", "Good", "Excellent");
if ($currentTag eq "title")
{
print "<i>$data</i>";
}
elsif ($currentTag eq "author")
{
print $data;
}
elsif ($currentTag eq "price")
{
print "\$$data";
}
elsif ($currentTag eq "rating")
{
print $ratings[$data];
}
}
# this is called when an end tag is found
sub end()
{
my ($parser, $name) = @_;
$currentTag = lc($name);
if ($currentTag eq "book")
{
print "</tr>";
}
elsif ($currentTag eq "title")
{
print "</td>";
}
elsif ($currentTag eq "author")
{
print "</td>";
}
elsif ($currentTag eq "price")
{
print "</td>";
}
elsif ($currentTag eq "rating")
{
print "</td>";
}
# clear value of current tag
$currentTag = "";
}
# end
And when you run it, here's what you'll see:
You can now add new items to your XML document, or edit existing items,
and your rendered HTML page will change accordingly. By separating the
data from the presentation, XML has imposed standards on data
collections, making it possible, for example, for users with no
technical knowledge of HTML to easily update content on a Web site, or
to present data from a single source in different ways.
In addition to elements and CDATA, Perl also allows you to set up
handlers for other types of XML structures, most notably PIs, entities
and notations (if you don't know what these are, you might want to skip
this section and jump straight into another, more complex example on
the next page). As demonstrated in the previous example, handlers for
these structures are set up by specifying appropriate callback
functions via a call to the setHandlers() object method.
Here's a quick list of the types of events that the parser can handle,
together with a list of their key names (as expected by the
setHandlers() method) and a list of the arguments that the
corresponding callback function will receive.
Key Arguments Event
to callback
------------------------------------------------------------------------
Final parser handle Document parsing completed
Start parser handle, Start tag found
element name,
attributes
End parser handle, End tag found
element name
Char parser handle, CDATA found
CDATA
Proc parser handle, PI found
PI target,
PI data
Comment parser handle, Comment found
comment
Unparsed parser handle, entity, Unparsed entity found
base, system ID, public
ID, notation
Notation parser handle, notation, Notation found
base, system ID, public
ID
XMLDecl parser handle, XML declaration found
version, encoding,
standalone
ExternEnt parser handle, base, External entity found
system ID, public ID
Default parser handle, data Default handler
As an example, consider the following example, which uses a simple XML document,
<?xml version="1.0"?>
<random>
<?perl print rand(); ?>
</random>
in combination with this Perl script to demonstrate how to handle processing instructions (PIs):
#!/usr/bin/perl
# include package
use XML::Parser;
# initialize parser
$xp = new XML::Parser();
# set PI handler
$xp->setHandlers(Proc => \&pih);
# output some HTML
print "Content-Type: text/html\n\n";
print "<html><head></head><body>And the winning number is: ";
$xp->parsefile("pi.xml");
print "</body></html>";
# this is called whenever a PI is encountered
sub pih()
{
# extract data
my ($parser, $target, $data) = @_;
# if Perl command
if (lc($target) == "perl")
{
# execute it
eval($data);
}
}
# end
In this case, the setHandlers() method knows that it has to call the
subroutine pih() when it encounters a processing instruction in the XML
data; this user-defined pih() function is automatically passed the PI
target and the actual command to be executed. Assuming the command is a
Perl command - as indicated by the target name - the function passes it
on to eval() for execution.
Here's another, slightly more complex example using the SAX parser, and one of my favourite meals.
<?xml version="1.0"?>
<recipe>
<name>Chicken Tikka</name>
<author>Anonymous</author>
<date>1 June 1999</date>
<ingredients>
<item>
<desc>Boneless chicken breasts</desc>
<quantity>2</quantity>
</item>
<item>
<desc>Chopped onions</desc>
<quantity>2</quantity>
</item>
<item>
<desc>Ginger</desc>
<quantity>1 tsp</quantity>
</item>
<item>
<desc>Garlic</desc>
<quantity>1 tsp</quantity>
</item>
<item>
<desc>Red chili powder</desc>
<quantity>1 tsp</quantity>
</item>
<item>
<desc>Coriander seeds</desc>
<quantity>1 tsp</quantity>
</item>
<item>
<desc>Lime juice</desc>
<quantity>2 tbsp</quantity>
</item>
<item>
<desc>Butter</desc>
<quantity>1 tbsp</quantity>
</item>
</ingredients>
<servings>
3
</servings>
<process>
<step>Cut chicken into cubes, wash and apply lime juice and salt</step>
<step>Add ginger, garlic, chili, coriander and lime juice in a separate bowl</step>
<step>Mix well, and add chicken to marinate for 3-4 hours</step>
<step>Place chicken pieces on skewers and barbeque</step>
<step>Remove, apply butter, and barbeque again until meat is tender</step>
<step>Garnish with lemon and chopped onions</step>
</process>
</recipe>
This time, my Perl script won't be using an "if" statement when I parse
the file above; instead, I'm going to be keying tag names to values in
a hash. Each of the tags in the XML file above will be replaced with
appropriate HTML markup.
#!/usr/bin/perl
# hash of tag names mapped to HTML markup
# "recipe" => start a new block
# "name" => in bold
# "ingredients" => unordered list
# "desc" => list items
# "process" => ordered list
# "step" => list items
%startTags = (
"recipe" => "<hr>",
"name" => "<font size=+2>",
"date" => "<i>(",
"author" => "<b>",
"servings" => "<i>Serves ",
"ingredients" => "<h3>Ingredients:</h3><ul>",
"desc" => "<li>",
"quantity" => "(",
"process" => "<h3>Preparation:</h3><ol>",
"step" => "<li>"
);
# close tags opened above
%endTags = (
"name" => "</font><br>",
"date" => ")</i>",
"author" => "</b>",
"ingredients" => "</ul>",
"quantity" => ")",
"servings" => "</i>",
"process" => "</ol>"
);
# name of XML file
$file = "recipe.xml";
# this is called when a start tag is found
sub start()
{
# extract variables
my ($parser, $name, %attr) = @_;
# lowercase element name
$name = lc($name);
# print corresponding HTML
if ($startTags{$name})
{
print $startTags{$name};
}
}
# this is called when CDATA is found
sub cdata()
{
my ($parser, $data) = @_;
print $data;
}
# this is called when an end tag is found
sub end()
{
my ($parser, $name) = @_;
$name = lc($name);
if ($endTags{$name})
{
print $endTags{$name};
}
}
# include package
use XML::Parser;
# initialize parser
$xp = new XML::Parser();
# set callback functions
$xp->setHandlers(Start => \&start, End => \&end, Char => \&cdata);
# send standard header to browser
print "Content-Type: text/html\n\n";
# print HTML header
print "<html><head></head><body>";
# parse XML
$xp->parsefile($file);
# print HTML footer
print "</body></html>";
# end
In this case, I've set up two hashes, one for opening tags and one for
closing tags. When the parser encounters an XML tag, it looks up the
hash to see if the tag exists as a key. If it does, the corresponding
value (HTML markup) is printed. This method does away with the slightly
cumbersome branching "if" statements of the previous example, and is
easier to read and understand.
Here's the output:
Perl comes with a DOM parser based on the expat library created by
James Clark; it's implemented as a Perl package named XML::DOM, and
currently maintained by T. J. Mather. If you don't already have it, you
should download and install it before proceeding further; you can get a
copy from CPAN (http://www.cpan.org/).
This DOM parser works by reading an XML document and creating objects
to represent the different parts of that document. Each of these
objects comes with specific methods and properties, which can be used
to manipulate and access information about it. Thus, the entire XML
document is represented as a "tree" of these objects, with the DOM
parser providing a simple API to move between the different branches of
the tree.
The parser itself supports all the different structures typically found
in an XML document - elements, attributes, namespaces, entities,
notations et al - but our focus here will be primarily on elements and
the data contained within them. If you're interested in the more arcane
aspects of XML - as you will have to be to do anything complicated with
the language - the XML::DOM package comes with some truly excellent
documentation, which gets installed when you install the package. Make
it your friend, and you'll find things considerably easier.
Let's start things off with a simple example:
#!/usr/bin/perl
# create an XML-compliant string
$xml = "<?xml version=\"1.0\"?><me><name>Joe
Cool</name><age>24</age><sex>male</sex></me>";
# include package
use XML::DOM;
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parse($xml);
# print tree as string
print $doc->toString();
# end
In this case, a new instance of the parser is created and assigned to
the variable $xp. This object instance can now be used to parse the XML
data via its parse() function:
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parse($xml);
You'll remember the parse() function from the first part of this
article - it was used by the SAX parser to parse a string. When you
think about it, this isn't really all that remarkable - the XML::DOM
package is built on top of the XML::Parser package, and therefore
inherits many of the latter's methods.
With that in mind, it follows that the DOM parser should also be able
to read an XML file directly, simply by using the parsefile() method,
instead of the parse() method:
#!/usr/bin/perl
# XML file
$file = "me.xml";
# include package
use XML::DOM;
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parsefile($file);
# print tree as string
print $doc->toString();
# end
The results of successfully parsing an XML document - whether string or
file - is an object representation of the XML document (actually, an
instance of the Document class). In the example above, this object is
called $doc.
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parsefile($file);
This Document object comes with a bunch of interesting methods - and
one of the more useful ones is the toString() method, which returns the
current document tree as a string. In the examples above, I've used
this method to print the entire document to the console.
# print tree as string
print $doc->toString();
It should be noted that this isn't all that great an example of how to
use the toString() method. Most often, this method is used during
dynamic XML tree generation, when an XML tree is constructed in memory
from a database or elsewhere. In such situations, the toString() method
comes in handy to write the final XML tree to a file or send it to a
parser for further processing.
The Document object comes with another useful method, one which enables
you to gain access to information about the document's XML version and
character encoding. It's called the getXMLDecl() method, and it returns
yet another object, this one representing the standard XML declaration
that appears at the top of every XML document. Take a look:
#!/usr/bin/perl
# create an XML-compliant string
$xml = "<?xml version=\"1.0\"
encoding=\"utf-8\"?><me><name>Joe
Cool</name><age>24</age><sex>male</sex></me>";
# include package
use XML::DOM;
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parse($xml);
# get XML PI
$decl = $doc->getXMLDecl();
# get XML version
print $decl->getVersion();
# get encoding
print $decl->getEncoding();
# get whether standalone
print $decl->getStandalone();
# end
As you can see, the newly-created XMLDecl object comes with a bunch of
object methods of its own. These methods provide a simple way to access
the document's XML version, character encoding and status.
Using the Document object, it's also possible to obtain references to
other nodes in the XML tree, and manipulate them using standard
methods. Since the entire document is represented as a tree, the first
step is always to obtain a reference to the tree root, or the outermost
document element, and use this a stepping stone to other, deeper
branches. Consider the following example, which demonstrates how to do
this:
#!/usr/bin/perl
# create an XML-compliant string
$xml = "<?xml version=\"1.0\"?><me><name>Joe
Cool</name><age>24</age><sex>male</sex></me>";
# include package
use XML::DOM;
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parse($xml);
# get root node "me"
$root = $doc->getDocumentElement();
# end
An option here would be to use the getChildNodes() method, which is a
common method available to every single node in the document tree. The
following code snippet is identical to the one above:
#!/usr/bin/perl
# create an XML-compliant string
$xml = "<?xml version=\"1.0\"?><me><name>Joe
Cool</name><age>24</age><sex>male</sex></me>";
# include package
use XML::DOM;
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parse($xml);
# get root node "me"
@children = $doc->getChildNodes();
$root = $children[0];
# end
Note that the getChildNodes() method returns an array of nodes under
the current node; each of these nodes is again an object instance of
the Node class, and comes with methods to access the node name, type
and content. Let's look at that next.
Once you've obtained a reference to a node, a number of other methods
become available to help you obtain the name and value of that node, as
well as references to parent and child nodes. Take a look:
#!/usr/bin/perl
# create an XML-compliant string
$xml = "<?xml version=\"1.0\"?><me><name>Joe
Cool</name><age>24</age><sex>male</sex></me>";
# include package
use XML::DOM;
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parse($xml);
# get root node
$root = $doc->getDocumentElement();
# get name of root node
# returns "me"
print $root->getNodeName();
# get children as array
@children = $root->getChildNodes();
# this is the "name" element under "me"
# I could also have used $root->getFirstChild() to get here
$firstChild = $children[0];
# returns "name"
print $firstChild->getNodeName();
# returns "1"
print $firstChild->getNodeType();
# now to access the value of the text node under "name"
$text = $firstChild->getFirstChild();
# returns "Joe Cool"
print $text->getData();
# returns "#text"
print $text->getNodeName();
# returns "3"
print $text->getNodeType();
# go back up the tree
# start from the "name" element and get its parent
$parent = $firstChild->getParentNode();
# check the name - it should be "me"
# yes it is!
print $parent->getNodeName();
# end
As you can see, the getNodeName() and getNodeType() methods provide
access to basic information about the node currently under examination.
The children of this node can be obtained with the getChildNodes()
method previously discussed, and node parents can be obtained with the
getParentNode() method. It's fairly simple, and - once you play with it
a little - you'll get the hang of how it works.
A quick note on the getNodeType() method above: every node is of a
specific type, and this property returns a numeric code corresponding
to the type. A complete list of defined types is available in the Perl
documentation for the XML::DOM package.
Note also that the text within an element's opening and closing tags is
treated as a child node of the corresponding element node, and is
returned as an object. This object comes with a getData() method, which
returns the actual content nested within the element's opening and
closing tags. You'll see this again in a few pages.
Just as it's possible to access elements and their content, it's also
possible to access element attributes and their values. The
getAttributes() method of the Node object provides access to a list of
all available attributes, and the getNamedItem() and getValue() methods
make it possible to access specific attributes and their values. Take a
look at a demonstration of how it all works:
#!/usr/bin/perl
# create an XML-compliant string
$xml = "<?xml version=\"1.0\"?><me
species=\"human\"><name>Joe
Cool</name><age>24</age><sex>male</sex></me>";
# include package
use XML::DOM;
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parse($xml);
# get root node (Node object)
$root = $doc->getDocumentElement();
# get attributes (NamedNodeMap object)
$attribs = $root->getAttributes();
# get specific attribute (Attr object)
$species = $attribs->getNamedItem("species");
# get value of attribute
# returns "human"
print $species->getValue();
# end
Getting to an attribute value is a little more complicated than getting to an element. But hey - no gain without pain, right?
Using this information, it's pretty easy to re-create our first example using the DOM parser. Here's the XML data,
<?xml version="1.0"?>
<library>
<book>
<title>Dreamcatcher</title>
<author>Stephen King</author>
<genre>Horror</genre>
<pages>899</pages>
<price>23.99</price>
<rating>5</rating>
</book>
<book>
<title>Mystic River</title>
<author>Dennis Lehane</author>
<genre>Thriller</genre>
<pages>390</pages>
<price>17.49</price>
<rating>4</rating>
</book>
<book>
<title>The Lord Of The Rings</title>
<author>J. R. R. Tolkien</author>
<genre>Fantasy</genre>
<pages>3489</pages>
<price>10.99</price>
<rating>5</rating>
</book>
</library>
and here's the script which does all the work.
#!/usr/bin/perl
# XML file
$file = "library.xml";
# array of ratings
@ratings = ("Words fail me!", "Terrible", "Bad", "Indifferent", "Good", "Excellent");
# include package
use XML::DOM;
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parsefile($file);
# set up HTML page
print "Content-Type: text/html\n\n";
print "<html><head></head><body>";
print "<h2>The Library</h2>";
print "<table border=1 cellspacing=1 cellpadding=5> <tr>
<td align=center>Title</td> <td
align=center>Author</td> <td
align=center>Price</td> <td align=center>User
Rating</td> </tr>";
# get root node
$root = $doc->getDocumentElement();
# get children
@books = $root->getChildNodes();
# iterate through book list
foreach $node (@books)
{
print "<tr>";
# if element node
if ($node->getNodeType() == 1)
{
# get children
# this is the "title", "author"... level
@children = $node->getChildNodes();
# iterate through child nodes
foreach $item (@children)
{
# check element name
if (lc($item->getNodeName) eq "title")
{
# print text node contents under this element
print
"<td><i>" . $item->getFirstChild()->getData .
"</i></td>";
}
elsif (lc($item->getNodeName) eq "author")
{
print "<td>" . $item->getFirstChild()->getData . "</td>";
}
elsif (lc($item->getNodeName) eq "price")
{
print
"<td>\$" . $item->getFirstChild()->getData . "</td>";
}
elsif (lc($item->getNodeName) eq "rating")
{
$num = $item->getFirstChild()->getData;
print "<td>" . $ratings[$num] . "</td>";
}
}
}
print "</tr>";
}
print "</table></body></html>";
# end
This may appear complex, but it isn't really all that hard to
understand. I've first obtained a reference to the root of the document
tree, $root, and then to the children of that root node; these children
are returned as a regular Perl array. I've then used a "foreach" loop
to iterate through the array, navigate to the next level, and print the
content found in the nodes, with appropriate formatting. The numerous
"if" statements you see are needed to check the name of each node and
then add appropriate HTML formatting to it.
As explained earlier, the data itself is treated as a child text node
of the corresponding element node. Therefore, whenever I find an
element node, I've used the node's getFirstChild() method to access the
text node under it, and the getData() method to extract the data from
that text node.
Here's what it looks like:
I can do the same thing with the second example as well. However, since
there are quite a few levels to the document tree, I've decided to use
a recursive function to iterate through the tree, rather than a series
of "if" statements.
Here's the XML file,
<?xml version="1.0"?>
<recipe>
<name>Chicken Tikka</name>
<author>Anonymous</author>
<date>1 June 1999</date>
<ingredients>
<item>
<desc>Boneless chicken breasts</desc>
<quantity>2</quantity>
</item>
<item>
<desc>Chopped onions</desc>
<quantity>2</quantity>
</item>
<item>
<desc>Ginger</desc>
<quantity>1 tsp</quantity>
</item>
<item>
<desc>Garlic</desc>
<quantity>1 tsp</quantity>
</item>
<item>
<desc>Red chili powder</desc>
<quantity>1 tsp</quantity>
</item>
<item>
<desc>Coriander seeds</desc>
<quantity>1 tsp</quantity>
</item>
<item>
<desc>Lime juice</desc>
<quantity>2 tbsp</quantity>
</item>
<item>
<desc>Butter</desc>
<quantity>1 tbsp</quantity>
</item>
</ingredients>
<servings>
3
</servings>
<process>
<step>Cut chicken into cubes, wash and apply lime juice and salt</step>
<step>Add ginger, garlic, chili, coriander and lime juice in a separate
bowl</step>
<step>Mix well, and add chicken to marinate for 3-4 hours</step>
<step>Place chicken pieces on skewers and barbeque</step>
<step>Remove, apply butter, and barbeque again until meat is tender</step>
<step>Garnish with lemon and chopped onions</step>
</process>
</recipe>
and here's the script which parses it.
#!/usr/bin/perl
# XML file
$file = "recipe.xml";
# hash of tag names mapped to HTML markup
# "recipe" => start a new block
# "name" => in bold
# "ingredients" => unordered list
# "desc" => list items
# "process" => ordered list
# "step" => list items
%startTags = (
"name" => "<font size=+2>",
"date" => "<i>(",
"author" => "<b>",
"servings" => "<i>Serves ",
"ingredients" => "<h3>Ingredients:</h3><ul>",
"desc" => "<li>",
"quantity" => "(",
"process" => "<h3>Preparation:</h3><ol>",
"step" => "<li>"
);
# close tags opened above
%endTags = (
"name" => "</font><br>",
"date" => ")</i>",
"author" => "</b>",
"ingredients" => "</ul>",
"quantity" => ")",
"servings" => "</i>",
"process" => "</ol>"
);
# this function accepts an array of nodes as argument,
# iterates through it and prints HTML markup for each tag it finds.
# for each node in the array, it then gets an array of the node's children, and
# calls itself again with the array as argument (recursion)
sub printData()
{
my (@nodeCollection) = @_;
foreach $node (@nodeCollection)
{
print $startTags{$node->getNodeName()};
print $node->getFirstChild()->getData();
my @children = &getChildren($node);
printData(@children);
print $endTags{$node->getNodeName()};
}
}
# this function accepts a node
# and returns all the element nodes under it (its children)
# as an array
sub getChildren()
{
my ($node) = @_;
# get children of this node
my @temp = $node->getChildNodes();
my $count = 0;
my @collection;
# iterate through children
foreach $item (@temp)
{
# if this is an element
# (need this to strip out text nodes containing whitespace)
if ($item->getNodeType() == 1)
{
# add it to the @collection array
$collection[$count] = $item;
$count++;
}
}
# return node collection
return @collection;
}
use XML::DOM;
# instantiate parser
$xp = new XML::DOM::Parser();
# parse and create tree
$doc = $xp->parsefile($file);
# send standard header to browser
print "Content-Type: text/html\n\n";
# print HTML header
print "<html><head></head><body><hr>";
# get root node
$root = $doc->getDocumentElement();
# get children
@children = &getChildren($root);
# run a recursive function starting here
&printData(@children);
print "</table></body></html>";
# end
In this case, I've utilized a slightly different method to mark up the
XML. I've first initialized a couple of hashes to map XML tags to
corresponding HTML markup, in much the same manner as I did last time.
Next, I've used DOM functions to obtain a reference to the first set of
child nodes in the DOM tree.
This initial array of child nodes is used to "seed" my printData()
function, a recursive function which takes an array of child nodes,
matches their tag names to values in the associative arrays, and
outputs the corresponding HTML markup to the browser. It also obtains a
reference to the next set of child nodes, via the getChildren()
function, and calls itself with the new node collection as argument.
By using this recursive function, I've managed to substantially reduce
the number of "if" conditional statements in my script; the code is now
easier to read, and also structured more logically.
Here's what it looks like:
As you can see, you can parse a document using either DOM or SAX, and
achieve the same result. The difference is that the DOM parser is a
little slower, since it has to build a complete tree of the XML data,
whereas the SAX parser is faster, since it's calling a function each
time it encounters a specific tag type. You should experiment with both
methods to see which one works better for you.
There's another important difference between the two techniques. The
SAX approach is event-centric - as the parser travels through the
document, it executes specific functions depending on what it finds.
Additionally, the SAX approach is sequential - tags are parsed one
after the other, in the sequence in which they appear. Both these
features add to the speed of the parser; however, they also limit its
flexibility in quickly accessing any node of the DOM tree.
As opposed to this, the DOM approach builds a complete tree of the
document in memory, making it possible to easily move from one node to
another (in a non-sequential manner). Since the parser has the
additional overhead of maintaining the tree structure in memory, speed
is an issue here; however, navigation between the various "branches" of
the tree is easier. Since the approach is not dependent on events,
developers need to use the exposed methods and attributes of the
various DOM objects to process the XML data.
That just about concludes this little tour of parsing XML data with
Perl. I've tried to keep it as simple as possible, and there are
numerous aspects of XML I haven't covered here. If you're interested in
learning more about XML and XSL, you should visit the following links:
The XML specification, at http://www.w3.org/TR/2000/REC-xml-20001006
The XSLT specification, at http://www.w3.org/TR/xslt.html
The SAX project, at http://www.saxproject.org/
The W3C's DOM specification, at http://www.w3.org/DOM/
A number of developers have built and released Perl packages to handle
XML data - if you're ever on a tight deadline, using these packages
might save you some development time. Take a look at the following
links for more information:
The Perl XML module list, at http://www.perlxml.com/modules/perl-xml-modules.html
CPAN, at http://www.cpan.org/
The Perl XML FAQ, at http://www.perlxml.com/faq/perl-xml-faq.html
|
|