Collector is used to update the ReferList; it takes a list of server names from the specified files and tries to load the Cindex of each of those servers by invoking another program called GetIndex. It can be called at a set time by using crontab(1).
GetIndex is a program that loads documents from the WWW.
Although there are other programs available
that do the same thing, they have a different way of handling error message, therefore
a new one had to be made.
GetIndex support the following input :
GetIndex <URL> <filename> <suppress error>
it accepts the URL, and loads the output to the file specified. It can suppress any error
message by specifying YES.
After obtaining the Cindex from another server, Collector will either replace the old entry in
ReferList with the new one, or it will add the entry into ReferList. The ReferList
is sorted according to the number of keywords in an entry. Currently there are no limits
on how many servers a ReferList can accept.
If Collector encounters an error when it is trying to collect Cindex from a server, it will place this server into an error log file, and count how many times it fails. If this count is larger than 3 times, then Collector would not collect from this server for a period. It would try to get the Cindex every 3 times it encounters the same server.
If the server is loading Cindex after some unsuccessful tries, then its name would be erased
from the error log file.
Initially, it was planned that a Web browser would
be modified as part of the updating process for the Referring List.
So when the user was following a link to
request a document from a remote server, the browser would have
invoked Collector to collect the Cindex from that server.
After some consideration, it was concluded that
modifying a browser or writing a new browser might take too much time, and
that it might not be used by people because of the popularity of other browsers
such as Netscape.
Modifying a popular browser, such as Netscape, is virtually impossible because
source codes are not available. Therefore other options were considered.
Another idea was to take the history files that some Web browser leave in the users' root directory, and so
get a list of servers from which to load Cindex.
But that was abandoned because it would mean checking through
the private files of users.
The current Rumour system is supported by a proxy server.
A proxy server [1] is used mostly
to cache information
for a local network, or to
act as a firewall machine (machine that is part of the local network, but also
has access outside of the firewall).
I tried to write my own proxy server in the beginning with wwwlib. But the final product has trouble
with POST requests, cannot load all images, and cannot load from NCSA servers, therefore that
proxy server was abandoned. Currently, Rumour is using the
standard CERN proxy server log to get the
server names and this method is simple and effective.