next up previous
Next: Implementation Description Up: Updating in Rumour Previous: Jack - Updating

Collector - Updating ReferList

Collector is used to update the ReferList; it takes a list of server names from the specified files and tries to load the Cindex of each of those servers by invoking another program called GetIndex. It can be called at a set time by using crontab(1).

GetIndex

GetIndex is a program that loads documents from the WWW. Although there are other programs available that do the same thing, they have a different way of handling error message, therefore a new one had to be made.

GetIndex support the following input :

							GetIndex <URL> <filename> <suppress error>

it accepts the URL, and loads the output to the file specified. It can suppress any error message by specifying YES.

Ordering in ReferList

After obtaining the Cindex from another server, Collector will either replace the old entry in ReferList with the new one, or it will add the entry into ReferList. The ReferList is sorted according to the number of keywords in an entry. Currently there are no limits on how many servers a ReferList can accept.

Error loading Cindex from Server

If Collector encounters an error when it is trying to collect Cindex from a server, it will place this server into an error log file, and count how many times it fails. If this count is larger than 3 times, then Collector would not collect from this server for a period. It would try to get the Cindex every 3 times it encounters the same server.

If the server is loading Cindex after some unsuccessful tries, then its name would be erased from the error log file.

Initially, it was planned that a Web browser would be modified as part of the updating process for the Referring List. So when the user was following a link to request a document from a remote server, the browser would have invoked Collector to collect the Cindex from that server.

After some consideration, it was concluded that modifying a browser or writing a new browser might take too much time, and that it might not be used by people because of the popularity of other browsers such as Netscape. Modifying a popular browser, such as Netscape, is virtually impossible because source codes are not available. Therefore other options were considered.

Another idea was to take the history files that some Web browser leave in the users' root directory, and so get a list of servers from which to load Cindex. But that was abandoned because it would mean checking through the private files of users.

The current Rumour system is supported by a proxy server. A proxy server [1] is used mostly to cache information for a local network, or to act as a firewall machine (machine that is part of the local network, but also has access outside of the firewall).

I tried to write my own proxy server in the beginning with wwwlib. But the final product has trouble with POST requests, cannot load all images, and cannot load from NCSA servers, therefore that proxy server was abandoned. Currently, Rumour is using the standard CERN proxy server log to get the server names and this method is simple and effective.



next up previous
Next: Implementation Description Up: Updating in Rumour Previous: Jack - Updating



Tommy Wing Yiu Tsui
Tue Nov 7 10:21:32 EST 1995