Information Retrieval in Distributed Hypertexts

Information Retrieval in Distributed Hypertexts

Two approaches towards supporting information retrieval in distributed hypertexts have been used:

By building (and periodically updating) an index database for the whole hyperdocument, the first part of a query (finding candidate documents to be searched) can be supported. The database can deliver addresses (URL's) of nodes that satisfy certain conditions, like containing a given word in their title or header.
Searching can be done by navigation, meaning that nodes are retrieved by following links, and are scanned for the required information. From the embedded links in these nodes, new nodes to be retrieved are chosen and the links leading to them are followed. Since this search mechanism is time- and network-resource consuming, a clever selection algorithm and a good starting point are important.

Either way, for a distributed hyperdocument as large and as loosely connected as the World Wide Web the answers to queries will most likely be incomplete. A database will probably not contain all information of all nodes, because the navigation algorithm cannot be certain to locate all the nodes, given that parts of the Web may be disconnected, and some nodes may be hidden behind "clickable images" or forms. A navigational-search will also be incomplete because it does not have the time to scan the whole hyperdocument, and it too cannot find all the documents because they may not be reachable by navigation. A reasonable compromise is to start a navigational search from the answer given by a very large index-database. For the World Wide Web index databases exist, such as Alta Vista, while a navigational search algorithm, called the fish-search is available from the Eindhoven University of Technology.

As distributed hypertexts are usually read much more frequently than they are written, their performance benefits greatly from replication. Just like a cache memory is used between a cpu and main memory, and between main memory and disk, a cache between a local hypertext browser and the actual (remote parts of the) hyperdocument can be used to improve the performance and reduce the network traffic caused by searching for information in a distributed hypertext.