Home
Directory
Frontier
DaveNet
Mail
Search
Guestbook
System
Ads

Monday, January 5, 1998 at 1:50:45 PM Pacific

Tim Bray on Lark

From Tim Bray, tbray@textuality.com:

This isn't finished yet, but I am uncomfortable about the fact that for the last couple of months, there has not been a java-language XML syntax-checker that is really very close to the spec. So, at

http://www.textuality.com/Lark/

I have placed the Lark 1.0 final beta, and release 0.8 of Larval, a validating XML processor based on Lark.

Status Snapshot

In my tests, Lark does all the things it used to do, and also rejects 163 of 164 of James' non-well-formed documents; the odd-doc-out is the notorious 088.xml, which I consider to be well-formed and represents a policy issue that the WG is going to have to make a call on. The only hole I know about in Lark at the moment is that it doesn't do text declarations in external parsed entities; but I won't have time to work on it until next week, so decided to ship anyhow.

James' test-suite represents a tremendous resource: a de-facto reproducible test of conformance that will greatly increase the interoperability of XML docs. We are all considerably in his debt, not for the first time; thank you once again, James.

Larval validates quite a few things, and boots out quite a few other things, but has not been tested to anywhere near the same level that Lark has.

These class files have been compiled with Microsoft VJ++1.1 and tested with Microsoft JView and with Sun's Java from JDK 1.1.3. At the moment, if I compile with the Sun fastjavac, then neither the Sun nor Microsoft java interpreters can use the resulting class files. Admittedly, Lark.java and Larval.java are a pretty severe strain on a compiler; on the other I know about some pretty egregious violations of the Java language spec that will get by both of those compilers. I suspect that my current problem with fastjavac is as likely to be me breaking some rule about what can be in a static string (J++ is forgiving) as it is a compiler bug.

Source Available

There's a policy change in that the Java source code for every Lark class is now included in the distribution. If you actually look at Lark.java and Larval.java, you'll see that this is not quite as generous as it sounds.

Still Undone

Lark 1.0 has also not received a walk-through looking for dead code, software rot, and unconcealed evidence of stupidity, and has not been profiled. It is noticeably but not unbearably slower than 0.97, but it'll be faster before I'm done. I have established with previous releases that with a little work any given release of Lark can be made faster and smaller. This release has grown in size by 10K.

Lark's UTF8 processing is still pretty shaky - I think that the Java libraries are moving in the right direction fast enough to make it not cost-effective for me to wrangle with this much more at the moment. Since XmlInputStream is now available at source level, if someone were to want to plug in some robust UTF8 code that'd be lovely.

Everything else is conformant I think without exception.

Validation

Larval is just another version of Lark; but it has some more methods, most noticeably

public void validate(boolean)

which as a side-effect turns on processExternalEntities; there is a new validityError() callback in the Handler. Of course there are a bunch of new classes with names like DTD and Validator and Attlist and so on.

Larval is done this way because if you just use Lark, you'll never have to include any validation class files. I can get away with this because even though Java doesn't have a preprocessor, Lark does. Presumably I will use the same trick to do SAX.

The validation implementation is pretty naive. Rather than compiling tables, Larval builds a data structure more or less isomorphic to the declaration in the DTD, and then laboriously pokes around in it every time it sees a start/end tag. I think it proves that (a) a naive implementation of validation can be done, and (b) this isn't the right way to do it in the long-term. However, it's nowhere near as slow as I expected, and is good enough to be useful already in debugging XML documents.

Other Changes

The doPI method now has separate args for target and remainder.

There is a doXmlDeclaration method.

There is a new method to tell Lark what name it should use for the document Entity, e.g. in error reporting.

There is an ESIS class that extends Handler; I don't claim this to be anything like a real SGML ESIS, but it's sure useful in automated testing.

Future Plans

Lark's version will remain 1.0 as long as XML does (a long time, I hope). Once it's no longer 'final beta' Lark.toString() will add a build date-stamp to the "1.0" version string.

Larval will progress toward 1.0 as I get around to doing some really serious testing on it.


This page was last built on 1/5/98; 2:08:38 PM by Dave Winer. dave@scripting.com