Frontier: The Crawler Suite

The Crawler Suite

Web crawler software for Frontier.
Crawlers are also know as Spiders. Same idea...
This suite crawls a website, downloading all files it can find to your local hard disk.
I wrote this suite because I needed to grab some sites for testing with Frontier 4.2.
Full source is provided, it makes good sample code for lots of other kinds of projects, including a remote site outliner, a database-oriented remote search engine, or a site loader that brings a remote site into Frontier's object database.
You are welcome to take this code apart and put it back together any way you like.
Download and install
The Crawler suite requires: Frontier 4.2, NetEvents, and the Tag Extraction Kit.
Please download the Crawler suite:
ftp://ftp.scripting.com/userland/crawler.sit.hqx or
http://ws3.scripting.com/ftpMirror/userland/crawler.sit.hqx
Double-click on each of the Frontier files, click on OK to all confirmation prompts.
Setup
Choose Crawler from Frontier's Suites menu. This adds a new menu to your menu bar called Crawl.
Choose Open Site List from the Crawl menu. An outline opens with a single entry, the URL to a website. This is the site that will be downloaded. You can add as many sites as you want to this list.
Set the folder to hold crawled sites using the Choose Folder command in the Crawl menu.
You may not want to download large files. Open the Exception list and add file suffixes that you don't want to download. Initially, we set it up so the crawler won't download files ending with .hqx or .bin.
Get ready to crawl!
It's time. Let's do it.
Choose the Crawl! command from the Crawl menu. The NetEvents app launches. A log outline window opens. URLs start piling up. In the Finder, the website starts materializing. It's methodical. It gets everything and puts it in the right place. I love watching this thing!
Eventually the spider finishes, and the formerly remote site is now on my hard disk.
A couple of crawler logs are on Sample Crawler Log page.
Other options
Choose the Open Crawler Data command from the Crawl menu. A table, user.crawler, opens. You can use Frontier's table editor to change many of the parameters for crawling. Here are some notes.

user.crawler.textSuffix -- usually .html, it tells the crawler which files to scan for links to other pages. Images, gifs and jpegs, don't contain links.

user.crawler.keepLog -- if true, a log outline is built for each crawl. If false, no outline is built.

user.crawler.filecreator -- defaults to MOSS, the creator code for Netscape. Change this to MSIE or R*ch if you want to create Internet Explorer or BBEdit files.

user.crawler.defaultfilename -- when the crawler downloads a URL ending with a slash it has to invent a name for the file. This is the name it uses. If you get it wrong for a given site, you may end up with two copies of the home page for the site with different names.
It happens. It isn't the end of the world. Delete one of them and proceed with your life.
Excellence!
The Crawler suite builds on the work of Chuck Shotton, Arnold Lesikar, Danis Georgiadis and Brent Simmons.
It's a community at work producing excellent results.
That's realllly coooool.
Thanks!
© Copyright 1996-97 UserLand Software. This page was last built on 5/7/97; 1:16:47 PM. It was originally posted on 2/18/97; 5:38:49 PM. Internet service provided by Conxion.