Collection Synthesis

Donna Bergmark
Cornell Digital Library Research Group
December 1, 2001

As part of a project to create a very large online library of digital objects in the field of Science, Math, Engineering and Technology, we wish to explore the possibility of automatically building collections on certain topics.

The idea of building such collections is not new. CORA is a web crawling system that creates domain-specific portals [], for example. What is different in this project is that we are not looking to classify everything, or to obtain a large search result on previously indexed documents. Rather, we crawl the web starting from no point in particular and build a collection by accretion - adding downloaded documents that appear to be relevant to our topics. The objective is to keep the crawl focussed so that precision can remain high.

We define a collection as a cluster with special characteristics. We are not trying to achieve a clustering of the web; rather we are trying to find clusters of ten or so good documents on well-defined topic areas. In synthesizing collections for educational purposes, as in the NSDL project, precision is far more important then recall. Thus there are two objectives for this research:

In this project, the words cluster and collection can be used interchangeably. A cluster is a group of objects that are closer to each other than to the rest of the domain. This in turn requires a ``distance'' metric. Some metrics that have been used in the past, when the objects are scholarly works, are: citation relationships [3], term similarity [2,8], and co-authorship [7]. Whatever measure is used, a cluster can be represented by its centroid, or average document.

In this project, the word ``collection'' might also be considered synomyous with ``portal'' (or porthole


as ? said). That is because a collection will really just be a collection of links to resources on the web.

A search engine is based on indexing the documents on the web (or as much of it as feasible), and then when a query comes in, to return the documents containing query terms or are otherwise similar to the query. Typically an inverted index on key terms is used to quickly find relevant documents.

Why not simply use a search engine to build our collections? Some search engines, such as teoma (<http://www.teoma.com>), claim to return ``authorities'' [5] on any query posed to it. The problem with search engines is that they index only part of the web [6]. Instead, we plan to crawl the web and ``accrete'' collections around a set of centroids. See Figure 1.


  
Figure 1: One way to build collections.
\begin{figure}
\centering
\epsfxsize=3.2in\epsfbox{NSDL.eps}\end{figure}

First, we assume that for any topic t, a virtual collection of online documents exists about that topic. We select a few authorities from this collection (using a search engine) and from them we construct a centroid. This is easy, because we are using a vector space model to represent the web. That is, each document, as well as the centroid, is represented as a weighted term vector. The centroid is computed by downloading the top k search results, extracting all the words (modulo a stop list) from the documents, and then computing a single weighted term vector which is the centroid for a virtual cluster.

From the point of view of efficiency and selectivity, we build several collections at the same time. For example, using a topic hierarchy for math (<http://mathforum.org/library/toc.html>), we built 35 centroids relating to different math topics. A massive web crawl is then performed. The right part of Figure 1 illustrates what we might have after a partial crawl. Each down-loaded document is matched against all 35 centroids by computing the cosine correlation between the document's weighted term vector and the centroid. As the figure illustrates, some documents fall close to a given centroid, most documents are not close to any of the math centroids, and at the end of the crawl, we hope to have 35 different collections.

For the massive web crawl, we are using Compaq's web crawler, Mercator[4]. The advantages of this crawler are three: it is very fast, it is polite to servers, and it is extremely extensible. After development of various knobs and controls is complete, we expect to build some interesting collections on a number of topics.

Bibliography

1
W. Arms.
Automated digital libraries: How effectviely can computers be used for the skill tasks of professional librarianship.
D-Lib Magazine: The Magazine of Digital Library Research, July 2000.
<http://www.dlib.org/dlib/july00/arms/07arms.html>.

2
R. K. Belew.
Finding Out About.
Cambridge Press, 2001.

3
E. Garfield.
Mapping the structure of science, pages 98-147.
John Wiley & Sons, Inc. NY, 1979.
Available at <http://www.garfield.library.upenn.edu/ci/chapter8.pdf >.

4
A. Heydon and M. Najork.
Mercator: A scalable, extensible Web crawler.
World Wide Web, 2(4), Dec. 1999.

5
J. Kleinberg.
Authoritative sources in a hyperlinked environment.
Journal of the ACM, 46(5):604-632, 1999.

6
S. Lawrence and C. L. Giles.
Accessibility of information on the web.
Nature, 400(8), July 1999.

7
P. Mutschke.
Enhancing information retrieval in federated bibliographic data sources using author network based strategems.
In Proceedings of the 5th European Conference ECDL, Darmstadt, Germany, pages 287-299, Sept. 2001.

8
G. Salton.
Automatic Information Organization and Retrieval.
McGraw-Hill, New York, 1968.

9
D. Voss.
Better searching through science.
Science, 293(5537):2024, 2001.
Online at <http://www.sciencemag.org/cgi/content/full/293/5537/2024>.