Donna Bergmark
Cornell Digital Library Research Group
December 1, 2001
The idea of building such collections is not new. CORA is a web crawling system that creates domain-specific portals [], for example. What is different in this project is that we are not looking to classify everything, or to obtain a large search result on previously indexed documents. Rather, we crawl the web starting from no point in particular and build a collection by accretion - adding downloaded documents that appear to be relevant to our topics. The objective is to keep the crawl focussed so that precision can remain high.
We define a collection as a cluster with special characteristics. We are not trying to achieve a clustering of the web; rather we are trying to find clusters of ten or so good documents on well-defined topic areas. In synthesizing collections for educational purposes, as in the NSDL project, precision is far more important then recall. Thus there are two objectives for this research:
In this project, the words cluster and collection can be used interchangeably. A cluster is a group of objects that are closer to each other than to the rest of the domain. This in turn requires a ``distance'' metric. Some metrics that have been used in the past, when the objects are scholarly works, are: citation relationships [3], term similarity [2,8], and co-authorship [7]. Whatever measure is used, a cluster can be represented by its centroid, or average document.
In this project, the word ``collection'' might also be considered synomyous
with ``portal'' (or porthole
A search engine is based on indexing the documents on the web (or as much of it as feasible), and then when a query comes in, to return the documents containing query terms or are otherwise similar to the query. Typically an inverted index on key terms is used to quickly find relevant documents.
Why not simply use a search engine to build our collections? Some search engines, such as teoma (<http://www.teoma.com>), claim to return ``authorities'' [5] on any query posed to it. The problem with search engines is that they index only part of the web [6]. Instead, we plan to crawl the web and ``accrete'' collections around a set of centroids. See Figure 1.
First, we assume that for any topic t, a virtual collection of online documents exists about that topic. We select a few authorities from this collection (using a search engine) and from them we construct a centroid. This is easy, because we are using a vector space model to represent the web. That is, each document, as well as the centroid, is represented as a weighted term vector. The centroid is computed by downloading the top k search results, extracting all the words (modulo a stop list) from the documents, and then computing a single weighted term vector which is the centroid for a virtual cluster.
From the point of view of efficiency and selectivity, we build several collections at the same time. For example, using a topic hierarchy for math (<http://mathforum.org/library/toc.html>), we built 35 centroids relating to different math topics. A massive web crawl is then performed. The right part of Figure 1 illustrates what we might have after a partial crawl. Each down-loaded document is matched against all 35 centroids by computing the cosine correlation between the document's weighted term vector and the centroid. As the figure illustrates, some documents fall close to a given centroid, most documents are not close to any of the math centroids, and at the end of the crawl, we hope to have 35 different collections.
For the massive web crawl, we are using Compaq's web crawler, Mercator[4]. The advantages of this crawler are three: it is very fast, it is polite to servers, and it is extremely extensible. After development of various knobs and controls is complete, we expect to build some interesting collections on a number of topics.