Gerd Hoff
and
Martin Mundhenk
Universität Trier, FB IV - Informatik
D-54286 Trier (Germany)
hoffg@uni-trier.de
mundhenk@ti.uni-trier.de
The fast dissemination of new research results on the world-wide web is a new challenge for search engines. In many research areas, scientists make their newest results electronically available on their web site, long before the results appear in conference proceedings or in journals. Whereas a decade ago, the state of the art in a research area could be found out by reading conference proceedings and journals in the local library, nowadays it is additionally necessary to find the newest related electronic publications on the web - in other words, to maintain a virtual library of not-yet-printed literature. Traditional search engines do not help for this task. E.g., they do not index postscript documents, which is the electronic format of many preprints appearing on the web. The few existing searchable indices for postscript documents either cover too large fields - all of computer science, for example - to be really helpful, or they depend on some submission procedure which delays the appearance of the documents on the web.
We present a new approach for constructing a virtual library of scientific papers which is specialized in a relatively small research area and allows to find the latest new documents.
Different from other approaches, we do not search
for web pages which contain certain keywords, but
we search for web pages which are created by scientists
who are active in the research area under consideration.
For personal virtual bookshelves,
this information can e.g. be hand edited.
For a larger virtual library, we prefer an automated approach
and obtain the scientists' names from
computer science bibliographies on the web,
namely from
Michael Ley's DBLP server
(http://dblp.uni-trier.de/).
This allows to find the names of scientists who published at
certain specialized conferences or in specialized journals,
and therefore
the names found can be seen as ``certified.''
Using these names, our
HPSearch
system searches the scientists' Home Pages according to the names.
Locating these Home Pages is a difficult task,
because of the lack of any fixed page construction rules.
We determine about 500 characteristics
that control the search for the Home Pages.
Maintaining that information is a further primary task of HPSearch.
This is performed by our search engine Mops. It creates an index of these papers and makes it accessible on a web server. Whereas the search index is administered on the Mops server, the scientific papers from which it is extracted remain on the servers of their owners. In this way, a virtual and distributed library is generated.
In this project, we developed and implemented
HPSearch
and
Mops.
We tested our approach by creating two example indices.
The research area for the one index is
complexity theory,
and for the other index it is
BDDs
(binary
decision diagrams, a data structure for VLSI design
and verification).
Both indices are well used in the respective research communities.
The whole software runs on standard PCs.
We conclude that such a focused
crawling is very effective for building high-quality
virtual libraries,
using ordinary desktop hardware.
A more detailed description of the system can be found at http://www.minet.uni-jena.de/www/fakultaet/mundhenk/papers/virt-lib/.
Gerd Hoff and Martin Mundhenk