DaveNet : XML Parsers

XML Parsers

Thursday, June 4, 1998 by Dave Winer.

Good morning!

A brief DaveNet following a discussion on the general XML mail list.

http://listserv.hea.ie/lists/xml-l.html

It's an interesting list, and getting more interesting now that two of the biggest names in XML, Eve Maler and Tim Bray, are actively participating in the discussion.

Who am I?

I've never introduced myself to the leading developers in the XML world. I'd like to fix that!

First, I'm a web developer, I manage several home pages, and about 30,000 sub-pages, on my site. And I make tools for other web developers, with many more home pages and sub-pages.

What XML means to me

I see XML as a way of adding structure to the web. I dig structure, it's my friend. It allows me to set things up so with one command I can take plain content and make it look beautiful.

I like to change something in one place and have the change percolate everywhere, like a spreadsheet allows recalculation for numbers, I can recalculate the presentation of stories and information. Extending that way of doing things to XML is a natural thing, we're totally ready for this.

Today we generate HTML and live with its limits. If XML happens, with all that it promises, including vector graphics, it could revolutionize our world, if the standards are strong and if developers have real choices.

We're also using XML to build higher-level networked systems that are easy for web developers to understand and extend. The XML is actually invisible, but it's discoverable. To me, that's the essence of the web. There's a View Source command, you can lift the hood and replace any software or run it on any operating system, as long as you generate and understand the same tagged text and support the basic communication protocol of the Internet, HTTP.

So to me, the value of XML is structure, simplicity, discoverability and new power thru compatibility. This is where it has the power to be as big a force as HTML, maybe even bigger, but to do so it must avoid the problems that were not avoided in the evolution of HTML.

The mess in HTML browsers

There are two big HTML parsers, the one in Netscape Navigator and the one in Microsoft Internet Explorer. You have to check to see how each of these programs displays the stuff you're producing. The parser is at the center of how well each browser works, or doesn't. You may find that one browser displays a page perfectly and the other one mangles it.

Getting a site ready to ship is a negotiation process, a least common denominator workaround of the problems with one browser in ways that don't break the other one. It's not a nice world to develop in. You waste a lot of time and always settle for non-optimal HTML.

But that's how web developers work. The only validators that count are Navigator and Internet Explorer.

XML parsers

Every program that understands XML must also have a parser. The parser takes tagged text and compiles it into a structure that software, written in C, Java or a scripting language, can walk thru and operate on in some way. It's the quality of these parsers that determines whether the XML standard will be stronger than HTML.

Surely some parsers will be more popular than the others. Microsoft's XML parser, for example, will presumably be baked into all their products, from servers, to databases, content development tools and browsers. Their parser will be pretty close to ubiquitous. They may rightly feel that they define the real-world, least common subset of XML.

Another parser is available from James Clark, a longtime SGML developer. Like the Microsoft parser, Clark's is available to be included in other software. And it's likely to be popular since it's the one that Netscape is including in the next release of Navigator.

At UserLand we've written our own XML parser, we had to because Microsoft's isn't available in source code, and Clark's parser wasn't complete when we needed it earlier this year. We couldn't wait because we wanted to deploy a remote procedure calling protocol built on XML. It's part of the next release of Frontier, and we're already routinely building applications on top of it.

What will other developers do? Some will use one of the publicly available parsers, and some will cook their own. Even inside large companies I've seen developers use more than one XML parser strategy, and different source code, for different products.

The genie's almost out

How will we know that each of the parsers can understand what other software understands? The answer is disturbing -- unless we, as an industry, take some steps now, there will be no guarantee. That means XML will head down the same path as CORBA and HTML, with compatibility expressed in terms of products instead of standards.

Once the genie is out of the bottle it's impossible to put it back. Surely Microsoft and Netscape could invest the engineering resources to straighten out their HTML parsers now. But if they did, it would break all the pages that have been coded to work around their errors. It's the usual software thing, bugs turn into features, unless you catch them at the beginning.

XML is at its beginning, there's very little content, and there's a simple solution to the problem, but the solution must be implemented with people, not software or standards.

Rating and testing the parsers

We need an independent and objective rating service that tests each of the XML parsers as they become available.

Each parser is run thru a set of tests. Does the parser catch the errors? How clear are the error messages? Another set of tests explore the fringes of XML, the parts of XML that many parsers are not handling, or are not handling correctly. Examples of these areas include attributes and namespaces.

We should also have standardized performance testing as we do with other kinds of software. Assuming a parser is correct, how fast is it compared to other parsers? If I have a choice between Clark's parser or Microsoft's, which is faster and uses memory more efficiently? That's how every developer will make their decision on whether to use an existing parser or create their own.

Another rating, what does the programming interface look like? Does it implement an understandable callback structure? Does it generate an easy-to-navigate in memory tree? What languages can you use to walk the structure?

A publishing company could do it

One of the publishing companies, Ziff-Davis, IDG or CMP come to mind, could very inexpensively buy itself a strong position in the development of XML as a standard by establishing such a rating service, publish the results as a website that's kept current, and also (of course) make the results available in XML so we can write agents that keep track of the competition. It would be a fun way to start using XML, right now, no chicken and egg here.

Ziff-Davis's benchmarks site

Ziff-Davis has a site that explains how they do benchmarks for hardware and other kinds of software:

http://www.zdnet.com/zdbop/benchmk.html

It would be great to see a common suite of tests that they run against XML parsers.

Grounding

XML, as it is currently evolving, is missing real-world grounding.

If XML doesn't gain traction it isn't a problem, but incompatibility between different tools could be enough to stop it from gaining traction.

So an independent testing and evaluation service is a central, and missing, part of the XML world.

Dave Winer