My Hypertext Encyclopedia System with Information Retrieval

As more and more structured hypertext documents, such as SGML or XML documents become available on the Web, there is a growing demand to develop effective structured document retrieval which exploits both content and hierarchical structure of documents and return document elements with appropriate granularity.

Traditional information retrieval treats document as the smallest retrieval unit, but in many scenarios a user may actually require to search part of the document with higher precision and finer granularity. Suppose a user who studies history of military operations would like to find out “what military aircrafts were used in Desert Storm”. He or she may retrieve articles named Military Aircrafts and Gulf War as two of the top-ranked results, both of which contain only a part of relevant content. The user then has to scan each (usually very long) document to look for the relevant information, a time-consuming process which hinders the effectiveness of information retrieval. Such an information overload is very common in typical Web searching applications.

Today, with the widely use of XML, there is an increasing demand to develop better techniques for structured document retrieval. XML provides a standard and effective way for the author to explicitly express the structure of a document. We developed a hypertext retrieval and browsing encyclopedia sytem for the documents from the Encarta.

Previous work on partial retrieval of structured document has limited applications due to the requirement of structured queries and restriction on sliding along the document structure according to queries. We put forward a method for flexible element retrieval which can get relevant document elements with arbitrary granularity against natural language queries. The proposed techniques constitute a novel hierarchical index propagation and pruning mechanism and an algorithm of ranking document elements based on the hierarchical index.

Flexible information retrieval may return larger or smaller granularity than the real query needs to users. So a good user interface for browsing the results in the original tree structure context is very crucial for improving users’ query process.

Of course, this system has some certain limits of itself, which should be stressed in the future research.