seybold_header picture


Home

About This Site

Events

World Tour

Forum

Mailing List

Search

Consultants

About Seybold

Feedback

Contact Info


 

December, 10 1996

W3C publishes draft of simplified SGML

At last a sensible way to extend HTML

On the tenth anniversary of the adoption of SGML as an ISO standard, a band of SGML experts announced they have drafted a simplified subset of the language they hope will be adopted as a standard method of extending HTML to accommodate user-defined tags and attributes. The new language, Extensible Markup Language, or XML, was prepared by a World Wide Web Consortium SGML working group and announced at the GCA SGML '96 conference held last month in Boston.

The draft XML specification is the culmination of an intense 11-week collaboration by a working group of 80 SGML experts, representing vendors, users and consultants. The group was led by Jon Bosak of Sun, who is also working on an online variation of DSSSL, the style sheet language for SGML documents.

The first published draft is publicly available on the Web at www.w3.org/pub/WWW/TR/WD-xml-961114.html

A more recent version, corresponding to that distributed at the conference, and in several formats including RTF and PostScript, is available at www.textuality.com/sgml-erb/


The SGML Web page run by Robin Cover has a whole section devoted to XML: http://www.sil.org/sgml/related.html#xmlSummary

Somebody had to do it. Addressing a standing-room only crowd at the SGML conference, Bosak admitted that he was motivated to lead the effort by fear as much as anything else: "I guarantee you that if we had not developed [a way to extend HTML], someone would have." Bosak postulated that if the SGML community had not offered a way to extend HTML, Web tool vendors would have developed another method (one not compatible with SGML) within 6 months.

Explaining the need for extensible markup, Bosak cited two examples. Within Sun, the documentation group was wrestling with how to create a dynamic table of contents in HTML, mimicking the behavior of some electronic books, such as those in EBT's DynaText. And a speech synthesis vendor was desperately trying to embed markup that its tool could recognize by using unique attributes in HTML tags. Both vendors had to rely heavily on comments, which the browser ignores but a special application could be written to recognize.

No doubt many readers have noticed the recent trend among Web-enabled document databases to make use of special tags that are resolved at the server, rather than the browser.

In fact, there are an unlimited number of applications for user-defined tags. One need look no further than the word processors and page layout programs we use today to realize how ludicrous it is to restrict all of our Web documents to one tagset. Imagine having to create all of your documents with one set of style tags. (The effect is much like today's e-mail programs -- an editor for a single document type. We don't use them for anything else; no wonder attachments have become so widespread.)

What is XML? XML, like SGML, is a meta-language for describing the markup of different types of documents. It is simpler than SGML, reducing a 500-page reference down to 26 pages. The W3C hopes that offering a simplified version of SGML will make implementing SGML much more palatable to vendors of Web authoring and browsing tools.

Unlike HTML, which has a fixed (albeit changing) set of tags, XML lets you define your own tags and attributes. Support for XML by the Internet community would open up vast new possibilities for Internet publishing: instead of shoehorning all documents into HTML, or having to invent a browser to handle non-HTML documents, XML would enable a wide array of documents with user-defined tagsets to be handled by generic Web application software.

Users of SGML can easily make use of XML. XML is a valid subset of SGML, so translation from SGML to XML is very straightforward.

To simplify SGML, the W3C working group dropped support for certain features that put a heavy processing burden on SGML client software. For example, a well-formed XML document is unambiguous, so that a browser or editor can read the tags and create a tree of the hierarchical structure without having to read a document type definition. XML also does not allow markup minimization, requires that empty elements be self-identifying and does not support several of the other complex optional features of the SGML standard.

One potential hurdle for vendors to clear is XML's use of Unicode as its character set. For Internet publishing, which is, after all, global, adopting Unicode makes a lot of sense. But there is still a scarcity of browsers and fonts that support the full Unicode character set. One short-term fix for browsers would be a shim that turns the double-byte Unicode characters into entity references. (Today's ASCII editors will be able to create valid XML documents; the UTF-8 encoding of ASCII is merely a different encoding of the first part of the Unicode character set.)

For documents with Latin character sets there are hundreds of Unicode fonts available. In the long run, XML could help jump start the development of the next generation of multilingual document processing tools.

Don't we have Java? Yes, it's true that one can author rich documents in an application and then use a Java applet viewer to attach those documents to Web pages. (This is what FutureTense Texture, SoftPress UniQorn and Corel Ventura do.) As long as the browsers continue to provide only crude formatting, we may be forced to resort to such measures, just as we use page makeup programs to get better typography than can be done with word processors.

But there's no reason why the basic Web page needs to be limited to a single tag set. The appeal of the Web is its simple hypertext scheme, which provides a simple, unambiguous method of pointing to files with unique names. It's handy that HTML is also simple, but the success of word processors has shown that consumers can easily cope with multiple document types.

Indeed, if HTML is successful, it seems reasonable to predict that Web authoring tools will become much more flexible in handling basic document constructs. One can imagine exporting from Word or WordPerfect into XML, using the style names as tags instead of filtering everything into 90 (or 100 or 200, etc.) predefined tags.

One role of Java, then, will be to do interesting things with the content. As Tim Bray, one of the co-editors of XML, pointed out, "XML gives Java something to chew on."

SGML vendors hop on board. The day of the announcement, many vendors at the SGML conference were quick to endorse it. ArborText, Stilo and SoftQuad, providers of SGML editing software, quickly womped up XML import and export filters. Berger-Levrault and EBT quickly announced their intention to support XML in their products.

ArborText's XML extension

Photo: Trivial extension. ArborText rigged an XML extension to its SGML editing tool five days after the spec was published. For SGML authoring vendors, XML can be easily exported. Importing tags and attributes is also easy, but some tools are better than others at handling the ad-hoc document type definitions that could result from XML editors of the future.


The few critics we were able to find were SGML proponents concerned that XML be perceived as "SGML Light," a way to get the benefits of SGML without the work. In a published editorial last month, Brian Travis, editor of the <TAG> newsletter, decried the XML effort, saying "the best thing that could happen to the working group would be for it to disband." On the day of the announcement, though, Travis back-pedaled, saying XML might be useful as a way to extend HTML. But he still expressed strong reservations. "It is not a reasonable archive format for serious publishers. I'm afraid some will get the idea that they could now move forward and implement XML, thinking that they were doing SGML."

Another skeptic is John McFadden, CEO of OmniMark Technologies, a leading provider of SGML conversion tools and services. In an interview after the show, McFadden expressed his concern that users will be misled into thinking that just because XML is easier to parse, users can expect to see great new tools. "That's like saying if only it were easier to make the chips for VCRs, then it would be easier to program a VCR. Users don't make parsers, and vendors aren't prevented from making great applications because of the difficulty of writing parsers."

McFadden also worries that XML will destabilize SGML, creating confusion that will scare off potential users.

That is a possible scenario if XML is successful, but in our view not one to be feared. A good many of the "serious" publishers for whom SGML is appropriate are already committed to full implementations of the standard. For them, XML is a big win -- it gives them a much richer target for down-translating their documents for online presentation. Those who haven't implemented SGML before now have an alternative to archiving binary source files. XML, while simpler than SGML, is still generic markup -- distinguishing form from content, and freeing content from the constraints of vendors and their applications. Like HTML, XML will bring generic markup into the consumer consciousness in a way that SGML has never done. It will encourage the use of tags to add context to full-text retrieval. It will greatly simplify the use of documents as interfaces to database applications. And ultimately it will make creating interesting Web pages even easier than it is today.

But what about Microsoft and Netscape? If Microsoft and Netscape, the two leading suppliers of Web browsing tools, ignore XML, it will have little impact. But there are very good indications that Microsoft will support it. Jean Paoli of Microsoft is part of the 11-member editorial review board that makes final decisions on XML, and so was intimately involved with the development of the specification. The former head of development at Grif, an SGML vendor, Paoli is now leading a development team working on Internet Explorer. He was reluctant on the day of the announcement to commit his company's support for XML, saying only that his group is actively pursuing client-side computing. Microsoft has made no formal announcement about plans to support XML in Internet Explorer, but Paoli said the company continues to work on adding browser functionality based on open standards and customer feedback. Paoli also said Microsoft does not have a competing plan under development.


Microsoft is inviting customer feedback on whether Internet Explorer should support XML in a future release. Send your comments on this topic to Jean Paoli, jeanpa@microsoft.com

The one vendor noticeably absent from the working group was Netscape, even though it was invited to participate. "Their answer was that they don't do SGML," Bosak told the audience. Later, Netscape officials declined to comment.

We sincerely hope Netscape will realize the benefit XML will be to the intranet market on which it is focusing its efforts. It has been obvious for quite some time that HTML needs to evolve (that's why there's a new revision every six months); XML gives everyone, both vendors and users, a standard for marking up the elements and attributes of their Web documents, in a way that is vendor neutral. Embracing XML gives Netscape a chance to show that it really does know how to make a competitive browser, and other competitive client applications.

Up next: hypertext and style sheets. Having rocked the SGML community with the most radical SGML development in a decade, the W3C working group plans to continue with two more phases of XML. According to Jon Bosak, chair of the W3C SGML Editorial Review Board, the next phase will add more complex hyperlinking, and the third phase will address style sheets, using either an improved version of css or an online version of dsssl.

Here again, Netscape is the wildcard. WebWeek recently published a story that Netscape was considering limiting its support for CSS in order to focus on JavaScript. It could also go ahead and develop proprietary methods for implementing more complex hyperlinks. Both would be mistakes and only delay the inevitable acceptance of a more open, vendor-neutral approach. When Netscape owned the browser market, it could define the standards for HTML. But it no longer owns that market. It's time for it to reassure customers that it will foster free competition in software development, not document formats.

Worthy gift. It is fitting that XML was unveiled at the SGML conference that celebrated the tenth anniversary of the standard. XML is arguably the most important gift the SGML community has given to the general computing market since the standard's inception. Though HTML used the precepts of generic markup, it is a fixed tagset, not a metalanguage for describing an infinite number of document types. XML is not full SGML, but it has been carefully crafted to be as close as possible without placing an undue burden on developers and users. SGML may still be the right answer for professional publishing and niche business applications, but it has proven to be too complex and expensive for most mainstream business applications. In contrast, XML brings generic markup into the realm of everyday documents. If there ever is going to be an SGML for the mainstream, this is it.

And in the interim, at last we have a sensible way to extend HTML. As Bray nicely phrased it, "We can finally get off the HTML definition treadmill." For that, we owe this group our thanks and continued support, and should demand nothing less from our vendors.

Mark Walter

More SGML '96 show coverage will appear in an upcoming issue of The Seybold Report on Publishing Systems.


Selling SGML Using XML on the Web

March 10-12, 1997
Marriott Mission Valley
San Diego, CA

The Graphic Communications Association has planned a spring event focusing on XML. The two-day conference will include technical briefings on XML and on how to use SGML with the Web. XML product demonstrations are anticipated.

For more information, contact Marion Elledge at the GCA:
(703) 519-8193 or melledge@gca.org.


XML is also expected to be a topic at Seybold Seminars New York, coming up April 21-25 in 1997. Further details are available at (415) 578-6900 or on the Web site: www.seyboldseminars.com


 

 

 

© Copyright 1997 Seybold Seminars; Last modified 4/10/97 at 12:34:54 PM.