DaveNet : Automated Deep Linking

Automated Deep Linking

Wednesday, September 1, 1999 by Dave Winer.

Good morning DaveNet!

It's been almost a week since I last wrote, I've been working on Frontier, that's been taking almost all my working cycles, except for the time I've been enjoying with the company of our membership on discuss.userland.com.

This last week has been incredible on our servers. The goodwill and intelligence and positive spirits have been amazing. We've tackled some difficult issues, and have stayed on track, no flames, very little invasiveness, mostly just happy people being smart. In their power, not whining about how someone let them down. OK. Let me correct that. At least not all of the messages have been whining and powerless. (I've even posted a few of these, expressing my frustration at wanting to buy a new PC that works and being powerless to do so!)

In the background I've had a DaveNet piece in progress about syndication and aggregation. It's a very long piece, and it's been difficult for me to ship it because the issues are so complex. So instead I'm going to break it into smaller pieces and hope to have the whole area covered by the end of the month. There's something new going on, and it touches on some sensitive issues like who owns your home page -- you or the web?

Closing loops

These days when I expose an issue it gets better coverage than in the old days. Three recent examples:

1. The missing Microsoft Instant Messaging spec was first reported here in DaveNet. While the other news sources focused on the real-time battles between AOL and Microsoft (a legitimate story) we focused on Microsoft. Their story had a huge gaping hole. They were calling for AOL to open up their servers, but Microsoft's servers weren't open! A few days later, an announcement came from Microsoft that they would provide the specification in two weeks, and this morning it was released . Loop closed. From here it will be interesting to see if any compatible software is offered from other developers.

2. We reported on the deep linking issue in general, and InfoWorld's deep linking policy specifically , and as a result, InfoWorld reversed their policy, giving us credit for having raised the issue and bringing about the change .

3. And finally, we've been reporting on antique software, the re-releases of VisiCalc , Dennis Ritchie's C compiler, the early Borland languages and the Living Videotext outliners . This resulted in coverage from the Washington Post, the London Sunday Times, the San Jose Mercury-News, and continuing coverage on the MacInTouch website.

I mention these things because it's important to note success. It's not always like pissing in the wind. Sometimes when an issue is raised there are next steps and (in my humble opinion) good things happen.

Who owns your home page?

So, in that spirit, I raise a question for all people and organizations who run news oriented websites.

Do you mind if other websites continually read the HTML of your home page, break it up into individual headlines and links to stories, repurpose the URLs to point thru their servers, and distribute those links to people who want to read your news, along with news from many other sources?

This is not an idle question -- because there are several sites, some operating out of public view, that are running "story scraper" agents, that do exactly this. The sites they visit are the most popular news sites. To my knowledge, few if any of them have asked for the permission of the sites the scrape.

There's a line

We've engaged in discussion with some of the agents, some of the discussion has been respectful and positive, and some of it hasn't. I don't have much doubt as to whether it's fair or legal. I think it is unfair, and while I'm not a lawyer, judge or jury, I also think if one of the target sites objected and sued, the scrapers would be forced to stop.

On the other hand, there's been very little awareness of this issue. If the scrapers sent a request for permission, would the question get any consideration? It's my hope, with this article, that that process can begin now.

In my opinion, it's clearly OK to link to a story on another site as it relates to a story I'm writing, or a subject I believe my readers are interested in. For example, it would be fair for me to point to the InfoWorld article where they reversed their position on deep linking, or to the Washington Post article about antique software.

But, again in my opinion, it would not be fair for me to write a script that reads the home page of the San Jose Mercury-News, pulls off links to all the articles and the headline text and emails them, mixed in with links scraped from the New York Times home page, to a list, unless the Mercury-News and the Times had given their permission.

The collection of stories on a home page, in my opinion, is a creative work. It's intellectual property, and is protected if there's a copyright notice on the home page. That's why the copyright notice is there -- to claim ownership, and to claim the right to determine how the information and ideas on the page, as a collection, are used.

There is a line here, on one side of the line, we're quoting legitimate sources. On the other side, it's theft of intellectual property. It is my belief that at least some of the current scraper engines are on the wrong side of the line.

It's not an easy problem!

Now the proponents of scraping point out that what they're doing is exactly what the search engines, Alta Vista, Excite, Google, et al, are doing. And they would be right, up to a point.

On the search engines, they don't present complete works. They take the links and invisibly store them in a database, and when a search is submitted, they send the links back to interested readers.

I use search engines, I depend on them for my work, it would be difficult for me to say they're doing anything wrong.

Why I don't scrape

Enter RSS, an XML-based format for delivering to agents exactly the information that the scrapers are deriving from the HTML source of the websites. I've written about RSS before, my company has made a substantial investment in it, we co-designed the format with Netscape, and we've built software on both sides of RSS, a content management system that generates RSS, and an aggregator that collects it. I believe RSS is the right way to go for the following reasons.

First, I prefer to build on a format that's used solely for syndication, so the webmaster's intent can't possibly be misunderstood. If you have an RSS file that's registered with my syndication engine, you clearly want me to aggregate it. There's no other purpose for an RSS file.

Second, by storing the information in RSS, there's no mistaking a non-news link for a news story. Have you ever used a search engine to search for a story on one subject and have it return twenty hits because they contained links to the story? With a separate syndication file, again, there's no mistaking what a link is about. Only news stories belong in a RSS file, not links from the template for the HTML rendering.

Third, we can extend RSS in the future to include other important information. For example, The Motley Fool, which covers publicly traded companies, wants to include a set of ticker symbols with every story. This makes perfect sense for their kind of content, but would not be needed for a site that covers a set of open source development projects. By starting fresh with a new format, just for syndication, we'll have more room to expand the format in the future to respond to opportunities in the market.

An example

Here's an example of an RSS file, one that lists four stories that are available on The Motley Fool website:

http://nirvana.userland.com/misc/foolDemoChannel.xml

The full file which is updated whenever they add or change a story on their site:

http://www.fool.com/xml/foolnews_rss091.xml

Excellent tutorial

Netscape has an excellent tutorial on RSS on their website.

http://my.netscape.com/publish/help/mnn20/quickstart.html

The solution is simple

If you run a popular website, it's now time to think about the automated deep linking question.

My recommendation, if you want to disseminate links to your stories thru the new channels that are developing, make a small investment by producing a parallel version of your home page in RSS format and let us know where it is:

http://my.userland.com/register

Our channel list is public, so any competitive aggregation engine can read the list, and churn your links thru their engine. We may profit at some point in the future from an economic system that rewards us, and our affiliates, for bringing traffic to your site. We already have competition, from Netscape, and there's sure to be more of that, because this is a valuable service and there's lots of room for diversification.

So here's the summary. If you run a popular site, you're probably being scraped right now. This is not a someday issue, it's a right-now issue. If you want your links spread far and wide, consider supporting RSS. If you don't want that to happen, make your intent clear.

Dave Winer