$Id: Release-Notes-1.3.txt,v 1.14 1995/09/05 21:00:00 duane Exp $

                            TABLE OF CONTENTS

	1. Gatherer
		IP-based filtering
		Username/Passwords
		Post-Summarizing
		Cache directory cleanup
		Limit on retrieval size
		Support for HTML-3.0, Netscape, and HotJava DTDs
	2. Broker
		Brokers.cf
		Glimpse 3.0
		Verity/Topic
		WAIS, Inc.
		Displaying SOIF attributes in results
		Uniqify duplicate objects
		Glimpse inline queries
	3. Cache
		Persistent disk cache
		Common logfile format
		Improved Internet protocol support
		TTL calculation by regexp
		Improved customizability
		Security
		Performance Enhancements
		Portability
		Optional Code
	4. Miscellaneous
		Admin scripts


========================================================================

GATHERER

    IP-based filtering
    ------------------

	It is now possible to use an IP network address in a Host
	filter file.  The IP address is matched using regular
	expressions.  This means that periods must be escaped.  For
	example:

		Allow	128\.196\..*
		Deny	.*

    Username/Passwords
    ------------------

	It is now possible to gather password-protected documents from
	HTTP and FTP servers.  In both cases, it is possible to specify
	a username and password as a part of the URL.  The format is

		 ftp://user:password@host:port/url-path
		http://user:password@host:port/url-path

	With this format, the "user:password" part is kept as a part
	of the URL string all throughout Harvest.  This may enable
	anyone who uses your Broker(s) to access password-protected
	pages.

	It is also possible to have "hidden" username and password
	information.  These are specified in the gatherer.cf file.
	For HTTP, the format is

		HTTP-Basic-Auth: realm username password

	'realm' is the same as the 'AuthName' parameter given in an
	NCSA .htaccess file.  In the CERN HTTP configuration, the 
        realm value is called 'ServerId.'

	For FTP, the format in the gatherer.cf file is

		FTP-Auth: hostname[:port] username password


    Post-Summarizing
    ----------------

	It is now possible to "fine-tune" the summary information
	generated by the Essence summarizers.  A typical application of
	this would be to change the 'Time-to-live' attribute based on
	some knowledge about the objects.  So an administrator could
	use the post-summarizing feature to give quickly-changing
	objects a lower TTL, and very stable documents a higher TTL.

	Objects are selected for post-processing if they meet a
	specified condition.  A condition consists of three parts:  An
	attribute name, an operation, and some string data.  For
	example:

		city == 'New York'

	In this case we are checking if the 'city' attribute is equal to
	the string 'New York' The for exact string matching, the string
	data must be enclosed in single quotes.  Regular expressions
	are also supported:

		city ~ /New York/

	Negative operators are also supported:

		city != 'New York'
		city !~ /New York/

	Conditions can be joined with '&&' (logical and) or '||' (logical or)
	operators:

		city == 'New York' && $state != 'NY';

	When all conditions are met for an object, some number of
	instructions are executed on it.  There are four types of
	instructions which can be specified:

        1.  Set an attribute exactly to some specific string
	    Example:

		time-to-live = "86400"

        2.  Filter an attribute through some program.  The attribute
	    value is given as input to the filter.  The output of the
	    filter becomes the new attribute value.
	    Example:

		keywords | tr A-Z a-z

        3.  Filter multiple attributes through some program.  In this
	    case the filter must read and write attributes in the
	    SOIF format.
	    Example:

		address,city,state,zip ! cleanup-address.pl

	4.  A special case instruction is to delete an object.  To do
	    this, simply write

		delete()

	The conditions and instructions are combined together in a
	"rules" file.  The format of this file is somewhat similar to a
	Makefile; conditions begin in the first column and instructions
	are indented by a tab-stop.  Example:

		type == 'HTML'
			partial-text | cleanup-html-text.pl

		URL ~ /users/
			time-to-live = "86400"
			partial-text ! extract-owner.sh

		type == 'SOIFStream'
			delete()

	This rules file is specified in the gatherer.cf file with the
	Post-Summarizing: tag, e.g.:

		Post-Summarizing: lib/myrules

    Cache directory cleanup
    -----------------------

	The gatherer uses a local disk cache of objects it has
	retrieved.  These objects are stored in the tmp/cache-liburl
	subdirectory.  Prior to v1.3 this cache directory was left in
	place after the gatherer completed.  This caused confusion and
	problems when users re-ran the gatherer and expected to see new
	or changed objects appear.

	Now the default behaviour is to remove the cache-liburl
	directory after the gatherer completes successfully.  Users who
	want to leave this directory in place will need to add

		Keep-Cache: yes

	to their gatherer.cf file.  

    Limit on retrieval size
    -----------------------

	The code for retrieving FTP, HTTP, and Gopher objects now stops
	transferring after 10M bytes.  This is to prevent bogus URL's
	from filling up local disk space.  This limit can currently
	only be changed by modifying the source in src/common/url (look
	for "MAX_TRANSFER_SIZE").

    Support for HTML-3.0, Netscape, and HotJava DTDs
    ------------------------------------------------

        DTDs for HTML-3.0, Netscape, and HotJava have been added to
	the collection in lib/gatherer/sgmls-lib/HTML/.  To take advantage
	of these DTDs your HTML pages should begin with one of:

	<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN">
	<!DOCTYPE HTML PUBLIC "-//Netscape Comm. Corp.//DTD HTML//EN">
	<!DOCTYPE HTML PUBLIC "-//Sun Micorsystems Corp.//DTD HotJava HTML//EN">

========================================================================

BROKER

    Brokers.cf
    ----------

	Prompted by security concerns, there is a change in the way
	that BrokerQuery.pl.cgi connects with a broker.  The old method
	had the broker hostname and port number passed as CGI
	arguments.  The new way passes the broker short name instead.
	This name is then looked up in the file
	$HARVEST_HOME/brokers/Brokers.cf.  The CreateBroker program
	will add the correct entry to Brokers.cf.

	The old method still works for backwards compatibility.  With
	the new method, the broker name must appear in the Brokers.cf
	file.  If it does not, the user receives an error message.

	The Brokers.cf file may also provide interesting features such as
	
		* quickly relocating brokers to other machines
		* using dual brokers for 24hr/day availability

	If you change your broker port number (in admin/broker.conf)
	then don't forget to change it here as well.

    Glimpse 3.0
    -----------

	Harvest now uses Glimpse 3.0 which includes a number of bugfixes
	and performance improvements:

		* A new data structure considerably speeds up queries
		  on large indexes.  Typical queries now take less
		  than one second, even for very large indexes.

		* Incremental indexing is now fully supported.

		* The on-disk indexing structures have been improved in
		  several ways.  As a result, indexes from previous
		  versions are incompatible.  When upgrading to this
		  release, you should remove all .glimpse_* files
		  in the broker directory before restarting the broker.

		* Glimpse can now handle more than 64k objects in the
		  broker.

    Verity/Topic
    ------------

	This release includes support for using Verity Inc.'s Topic
	indexing engine with the broker.  In order to use Topic with
	Harvest, a license must be purchased from Verity (see
	http://www.verity.com/).

	At this point, Harvest does not make use of all features in 
	the Topic engine.  However, does include a number of features
	that make it attractive:

		* Background indexing: the broker will continue to 
		  service requests as new objects are added to the
		  database.

		* Matched lines (or Highlights): lines containing query
		  terms are displayed with the result set.

		* Result set ranking

		* Flexible query operations such as proximity, stemming,
		  and thesaurus.

    WAIS, Inc.
    ----------

	This release includes support for using WAIS Inc.'s commercial
	WAIS indexing engine with the broker.  To use commercial WAIS
	with Harvest, a license must be purchased from WAIS Inc. (see
	http://www.wais.com/).  The WAIS/Harvest combination offers
	the following features:

		* Structured queries (not available with Free WAIS).

		* Incremental indexing

		* Result set ranking

		* Use of native WAIS operators, e.g. ADJ to find one
		  word adjacent to another.

    Displaying SOIF attributes in results
    -------------------------------------

        In v1.2 the Broker allowed specific attributes from matched
	objects to be returned in the result set.  However, there
	was no real support for this in BrokerQuery.pl.cgi.

	Now it is possible to request SOIF attributes with the use
	of HTML FORM facilities.  A simple approach is to include
	a select list in the query form.  For example:

		<SELECT NAME="attribute" MULTIPLE>
		<OPTION VALUE="title">Title
		<OPTION VALUE="author">Author
		<OPTION VALUE="date">Date
		<OPTION VALUE="subject">Subject
		</SELECT>

	In this manner, the user may control which attributes are
	displayed.  The layout of these attributes in HTML is 
	controlled by the '<FormatAttribute>' specification in
	$HARVEST_HOME/cgi-bin/lib/BrokerQuery.cf.

    Uniqify duplicate objects
    -------------------------

	Occasionally a broker may end up with duplicate entries for
	individual URLs.  This usually happens when the Gatherer
	changes (its description, hostname, or port number).  To remedy
	this situation, there is a "uniqify" command on the broker
	interface.  On the admin.html page it is described as "Delete
	older objects of duplicate URLs."  When two objects with the
	same URL are found, the object with the least-recent timestamp
	is removed.

    Glimpse inline queries
    ----------------------

	In v1.2 using Glimpse with the broker required the broker to
	fork a 'glimpse' process for every query.  Now the broker can
	make the query directly to the 'glimpseserver'.  If glimpseserver
	is disabled or not running for some reason, the broker will use
	the previous approach and spawn a glimpse process to handle the
	query.

========================================================================

CACHE

    Persistent disk cache
    ---------------------

	Upon startup the cache now "reloads" cached objects from a
	previous session.  While this adds some delay at startup,
	heavily used sites will benefit, especially where filling
	the cache with popular objects is expensive or time-consuming.

	To disable the persistent disk cache, add the '-z' flag to
	cached's command line.  This emulates the previous behaviour,
	which is to remove all previously cached objects at startup.

    Common logfile format
    ---------------------

	The cache now supports the httpd common logfile format which is
	used by many HTTP server implementations.  This makes the
	cache's access logfile compatible with many of the freely
	available logfile analyzers.  Note that the cache does not
	(yet) log the object size for requests which result in a
	'TCP_MISS'.

	There have been many improvements to the debugging output
	as well.

    Improved Internet protocol support
    ----------------------------------

	Numerous improvements and bugfixes have been made to HTTP, 
	FTP, and Gopher protocol implementations.  Additionally,
	a user-contributed patch for proxying to WAIS servers has
	been included.

    TTL calculation by regexp
    -------------------------

	It is now possible to have the cache calculate time-to-live 
	values based on URL regular expressions.  This would allow
	an administrator to set large TTL's for images and lower
	TTL's for text, for example.

	These are specified in the cached.conf file, beginning with
	the tag 'ttl_pattern'.  For example:

		ttl_pattern	^http://	1440	20%	43200

	The second field is a POSIX-style regular expression.  Invalid
	expressions are ignored.

	The third value is an absolute time-to-live, given in minutes.
	This value is ignored if negative.  A zero value indicates that
	an object matching the pattern should not be cached.  NOTE: the
	absolute TTL is used only if the percent-of-age (described
	next) is not used.

	The fourth value is a percent-of-age factor.  If the object is
	sent with valid Last-Modification timestamp information, then
	the object's TTL is calculated as

		TTL = (current-time - last-modified) * percent-of-age / 100;

	If the percent-of-age field is zero, or a last-modification
	timestamp is not present, then the algorithm looks at the
	absolute TTL value next.

	The fifth field is a maximum, upper-bound on the TTL to return
	for the percent-of-age method.  It is specified in minutes,
	with the default being 30 days.  This is provided in case a
	buggy server implementation returns ridiculous last-modification
	data.

    Improved customizability
    ------------------------

	More options have been added to the cache configuration file:

		* String-based stoplist to deny caching of objects
		  which contain the stoplist string (e.g.: "cgi-bin").

		* Support for "quick aborting."  When the client drops
		  a connection, the cache will abort the data transfer
		  immediately.  Useful for caches behind SLIP/PPP
		  connections.

		* The number of DNS lookup servers is now configurable.
		  The default is three.

		* The trace mail message sent to cs.colorado.edu
		 (containing only the IP address and port number of
		 your cache) can now be turned off.

    Security
    --------

	IP-based access controls are now supported.  The administrator
	may deny access to specific IP networks/hosts, or may only
	allow access from specific networks/hosts.  Two access control
	lists are maintained:  one for clients/browsers using the cache
	(the "ascii port") and another for the remote instrumentation
	interface (cache manager).

    Performance Enhancements
    ------------------------

	Several performance enhancements have been made to the cache:

		* The LRU replacement algorithm is more efficient and
		  quicker.  In conjunction with the new LRU replacement
		  policy the default low water mark has been changed 
		  from 80% to 60%.

		* The in-memory usage (metadata) of cached objects has
		  been reduced to 80-100 bytes per object.

		* The retrieval of various statistics from the instrumentation
		  interface is much faster.

		* User-configurable garbage collection reduces the number
		  of times these more expensive operations are performed.

		* Cleaned up and reduced overall memory management.  Our
		  checks with Purify report no memory leaks.

    Portability
    -----------

	The TCL libraries are no longer needed to compile the cache.
	User-contributed patches have been incorporated for better
	support on BSD, Linix, IRIX, and HP-UX systems.


    Optional Code
    -------------

        The following are recent additions to the code.  They can be
	optionally included by setting '-D' flags in the Makefile.

        CHECK_LOCAL_NETS

	    Define this to optimize retrievals from servers on your
	    local network.  If your cache is configured with a parent,
	    objects from your local servers may be pulled through the
	    parent cache.  To always retrieve local objects directly
	    define CHECK_LOCAL_NETS and rebuild the source code.  Then
	    add your local IP network addresses to the cache configuration
	    file with the 'local_ip' directive.  For example:

		local_ip 128.138.0.0
		local_ip 192.54.50.0

	LOG_FQDN

	    Client IP addresses are logged in the access log file.  To
	    log the fully qualified domain name instead, define LOG_FQDN
	    and rebuild the code.  WARNING: This is not implemented
	    efficiently and may adversely affect your cache performance.
	    Before each line is written to the access log file, a call
	    to gethostbyaddr(3) is made.  This library call may block
	    an arbitrary amount of time while waiting for a reply from
	    a DNS server.  While this function blocks, the cache will
	    not be able to process any other requests.  You have been warned.
	
	APPEND_DOMAIN

	    Define this and use the 'append_domain' configuration directive 
	    to append a domainname to hostnames without any domain
	    information.

	USE_WAIS_RELAY

	    Define this and use the `wais_relay' configuration directive
	    to allow WAIS queries to be cached and proxied.

========================================================================

MISCELLANEOUS

    Admin scripts
    -------------

	A number of sample scripts are provided to aid in administering
	your Harvest installation:

	RunGatherers.sh:  This script can be run from your ``/etc/rc''
	scripts to start the Harvest gatherer daemons at boot time.
	It must be customized with the directory names of your gatherers.
	It is installed in $HARVEST_HOME/lib/gatherer.

	RunBrokers.sh:  This script can be run from your ``/etc/rc''
	scripts to start the Harvest brokers at boot time.
	It must be customized with the directory names of your brokers.
	It is installed in $HARVEST_HOME/lib/broker.

	harvest-check.pl:  This Perl script is designed to be run 
	occasionally as a cron(1) job.  It will contact your gatherers
	and brokers and report on any which seem to be unreachable.
	The list of gatherers and brokers to contact can be specified
	at the end of the script.