Documents

mercator

Description
Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork Compaq Systems Research Center 130 Lytton Ave. Palo Alto, CA 94301 {heydon,najork}@pa.dec.com Abstract This paper describes Mercator, a scalable, extensible web crawler written entirely in Java. Scalable web crawlers are an important component of many web services, but their design is not well-documented in the literature. We enumerate the major components of any scalable web crawler, comment on alternatives and tradeoffs
Categories
Published
of 15
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Transcript
  Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork Compaq Systems Research Center130 Lytton Ave.Palo Alto, CA 94301 { heydon,najork  } @pa.dec.com Abstract This paper describes Mercator, a scalable, extensible web crawler written entirely in Java. Scalableweb crawlers are an importantcomponentof manyweb services, but their design is notwell-documentedin the literature. We enumerate the major components of any scalable web crawler, comment on alter-natives and tradeoffs in their design, and describe the particular components used in Mercator. We alsodescribe Mercator’s support for extensibility and customizability. Finally, we comment on Mercator’sperformance, which we have found to be comparable to that of other crawlers for which performancenumbers have been published. 1 Introduction Designing a scalable web crawler comparable to the ones used by the major search engines is a complexendeavor. However, due to the competitive nature of the search engine business, there are few papers inthe literature describing the challenges and tradeoffs inherent in web crawler design. This paper’s maincontribution is to fill that gap. It describes Mercator, a scalable, extensible web crawler written entirelyin Java.By scalable , we mean that Mercator is designed to scale up to the entire web, and has been used tofetch tens of millions of web documents. We achieve scalability by implementing our data structures sothat they use a bounded amount of memory, regardless of the size of the crawl. Hence, the vast majorityof our data structures are stored on disk, and small parts of them are stored in memory for efficiency.By extensible , we mean that Mercator is designed in a modular way, with the expectation that newfunctionality will be added by third parties. In practice, it has been used to create a snapshot of the webpages on our corporate intranet, to collect a variety of statistics about the web, and to perform a series of random walks of the web [12].One of the initial motivations of this work was to collect statistics about the web. There are manystatistics that might be of interest, such as the size and the evolution of the URL space, the distributionof web servers over top-level domains, the lifetime and change rate of documents, and so on. However,it is hard to know a priori exactly which statistics are interesting, and topics of interest may change overtime. Mercator makes it easy to collect new statistics — or more generally, to be configuredfor differentcrawlingtasks — byallowingusers toprovidetheir ownmodulesforprocessingdownloadeddocuments.For example,when we designed Mercator,we did not anticipate the possibility of using it for the randomwalk application cited above. Despite the differences between random walking and traditional crawling,we were able to reconfigure Mercator as a random walker without modifying the crawler’s core, merelyby plugging in modules totaling 360 lines of Java source code.The remainderof the paper is structured as follows. The next section surveys related work. Section 3describes the main components of a scalable web crawler, the alternatives and tradeoffs in their design,and the particularchoices we made in Mercator. Section 4 describes Mercator’s support for extensibility.Section 5 describes some the web crawling problems that arise due to the inherent anarchy of the web. 1  Section 6 reports on performance measurements and web statistics collected during a recent extendedcrawl. Finally, Section 7 offers our conclusions. 2 Related Work Web crawlers — also known as robots, spiders, worms, walkers, and wanderers — are almost as oldas the web itself [15]. The first crawler, Matthew Gray’s Wanderer, was written in the spring of 1993,roughly coinciding with the first release of NCSA Mosaic [22]. Several papers about web crawling werepresented at the first two World Wide Web conferences [9, 16, 18]. However, at the time, the web wastwo to three orders of magnitude smaller than it is today, so those systems did not address the scalingproblems inherent in a crawl of today’s web.Obviously, all of the popular search engines use crawlers that must scale up to substantial portionsof the web. However, due to the competitive nature of the search engine business, the designs of thesecrawlers have not been publicly described. There are two notable exceptions: the Google crawler andthe Internet Archive crawler. Unfortunately, the descriptions of these crawlers in the literature are tooterse to enable reproducibility.The Google search engine is a distributed system that uses multiple machines for crawling [4, 11].The crawler consists of five functionalcomponentsrunningin differentprocesses. A URL server process reads URLs out of a file and forwards them to multiple crawler processes. Each crawler process runson a different machine, is single-threaded, and uses asynchronous I/O to fetch data from up to 300 webservers in parallel. The crawlers transmit downloaded pages to a single StoreServer process , whichcompresses the pages and stores them to disk. The pages are then read back from disk by an indexer  process , which extracts links from HTML pages and saves them to a different disk file. A URL resolver  process reads the link file, derelativizes the URLs contained therein, and saves the absolute URLs to thedisk file that is read by the URL server. Typically, three to four crawler machines are used, so the entiresystem requires between four and eight machines.The Internet Archive also uses multiple machines to crawl the web [6, 14]. Each crawler process isassigned up to 64 sites to crawl, and no site is assigned to more than one crawler. Each single-threadedcrawler process reads a list of seed URLs for its assigned sites from disk into per-site queues, and thenuses asynchronous I/O to fetch pages from these queues in parallel. Once a page is downloaded, thecrawler extracts the links contained in it. If a link refers to the site of the page it was contained in, it isadded to the appropriate site queue; otherwise it is logged to disk. Periodically, a batch process mergesthese logged “cross-site” URLs into the site-specific seed sets, filtering out duplicates in the process.In the area of extensible web crawlers, Miller and Bharat’s SPHINX system [17] provides some of the same customizability features as Mercator. In particular, it provides a mechanism for limiting whichpages are crawled,and it allows customizeddocumentprocessingcodeto be written. However,SPHINXis targeted towards site-specific crawling, and therefore is not designed to be scalable. 3 Architecture of a Scalable Web Crawler The basic algorithm executed by any web crawler takes a list of  seed URLs as its input and repeatedlyexecutes the following steps. Remove a URL from the URL list, determine the IP address of its hostname, download the corresponding document, and extract any links contained in it. For each of theextracted links, ensure that it is an absolute URL (derelativizing it if necessary), and add it to the listof URLs to download, provided it has not been encountered before. If desired, process the downloadeddocument in other ways (e.g., index its content). This basic algorithm requires a number of functionalcomponents: ã a component (called the URL frontier  ) for storing the list of URLs to download; ã a component for resolving host names into IP addresses; ã a component for downloading documents using the HTTP protocol; ã a component for extracting links from HTML documents; and 2  ProtocolModulesProcessingModulesHTTPFTPGopherLinkExtractorGIFStatsTagCounterContentSeen?DNSResolverRISURLFilterURLSeen?URL FrontierINTERNET Mercator QueueFilesURLSetLogLogDocFPs 12 345 6 7 8 Figure 1: Mercator’s main components. ã a component for determining whether a URL has been encountered before.This section describes how Mercator refines this basic algorithm, and the particular implementationswe chose for the various components. Where appropriate, we comment on design alternatives and thetradeoffs between them. 3.1 Mercator’s Components Figure 1 shows Mercator’s main components. Crawling is performed by multiple worker threads, typ-ically numbering in the hundreds. Each worker repeatedly performs the steps needed to download andprocess a document. The first step of this loop 1 is to remove an absolute URL from the shared URLfrontier for downloading.An absolute URL begins with a scheme (e.g., “http”), which identifies the network protocol thatshould be used to download it. In Mercator, these network protocols are implemented by protocol mod-ules . The protocol modules to be used in a crawl are specified in a user-supplied configuration file, andare dynamically loaded at the start of the crawl. The default configuration includes protocol modulesfor HTTP, FTP, and Gopher. As suggested by the tiling of the protocol modules in Figure 1, there isa separate instance of each protocol module per thread, which allows each thread to access local datawithout any synchronization.Based on the URL’s scheme, the worker selects the appropriateprotocol modulefor downloadingthedocument. It then invokes the protocol module’s fetch method, which downloads the document from theInternet 2 into a per-thread RewindInputStream 3 (or RIS for short). A RIS is an I/O abstraction thatis initialized from an arbitrary input stream, and that subsequently allows that stream’s contents to bere-read multiple times.Once the document has been written to the RIS, the worker thread invokes the content-seen test  todetermine whether this document (associated with a different URL) has been seen before 4 . If so, thedocument is not processed any further, and the worker thread removes the next URL from the frontier.Every downloaded document has an associated MIME type. In addition to associating schemes withprotocolmodules,a Mercatorconfigurationfile also associates MIMEtypes with one or more  processingmodules . A processing module is an abstraction for processing downloaded documents, for instanceextracting links from HTML pages, counting the tags found in HTML pages, or collecting statisticsabout GIF images. Like protocol modules, there is a separate instance of each processing module perthread. In general, processing modules may have side-effects on the state of the crawler, as well as on 3  their own internal state.Based on the downloaded document’s MIME type, the worker invokes the process method of eachprocessing moduleassociated with that MIME type 5 . For example, the Link Extractorand Tag Counterprocessing modules in Figure 1 are used for text/html documents, and the GIF Stats module is used forimage/gif documents.By default, a processingmodule for extractinglinks is associated with the MIME type text/html. The  process method of this module extracts all links from an HTML page. Each link is converted into anabsolute URL, and tested against a user-supplied URL filter  to determine if it should be downloaded 6 .If the URL passes the filter, the worker performsthe URL-seen test  7 , which checks if the URL has beenseen before, namely, if it is in the URL frontier or has already been downloaded. If the URL is new, it isadded to the frontier 8 .The above description omits several important implementation details. Designing data structuresthat can efficiently handle hundreds of millions of entries poses many engineeringchallenges. Central tothese concerns are the need to balance memory use and performance. In the remainder of this section,we describe how we address this time-space tradeoff in several of Mercator’s main data structures. 3.2 The URL Frontier The URL frontier is the data structure that contains all the URLs that remain to be downloaded. Mostcrawlers work by performing a breath-first traversal of the web, starting from the pages in the seed set.Such traversals are easily implemented by using a FIFO queue.In a standard FIFO queue, elements are dequeued in the order they were enqueued. In the context of web crawling, however, matters are complicated by the fact that it is considered socially unacceptable tohave multiple HTTP requests pending to the same server. If multiple requests are to be made in parallel,the queue’s remove operation should not simply return the head of the queue, but rather a URL close tothe head whose host has no outstanding request.To implement this politeness constraint, the default version of Mercator’s URL frontier is actuallyimplemented by a collection of distinct FIFO subqueues. There are two important aspects to how URLsare added to and removed from these queues. First, there is one FIFO subqueue per worker thread.That is, each worker thread removes URLs from exactly one of the FIFO subqueues. Second, when anew URL is added, the FIFO subqueue in which it is placed is determined by the URL’s canonical hostname. Together, these two points imply that at most one worker thread will download documents from agiven web server. This design prevents Mercator from overloading a web server, which could otherwisebecome a bottleneck of the crawl.In actual world wide web crawls, the size of the crawl’s frontier numbers in the hundreds of millionsof URLs. Hence, the majority of the URLs must be stored on disk. To amortize the cost of readingfrom and writing to disk, our FIFO subqueue implementation maintains fixed-size enqueue and dequeuebuffers in memory; our current implementation uses buffers that can hold 600 URLs each.As described in Section 4 below, the URL frontier is one of Mercator’s “pluggable” components,meaning that it can easily be replaced by other implementations. For example, one could implement aURL frontier that uses a ranking of URL importance (such as the PageRank metric [4]) to order URLsin the frontier set. Cho, Garcia-Molina, and Page have performed simulated crawls showing that suchordering can improve crawling effectiveness [7]. 3.3 The HTTP Protocol Module The purpose of a protocol module is to fetch the document corresponding to a given URL using the ap-propriate network protocol. Network protocols supported by Mercator include HTTP, FTP, and Gopher.Courteous web crawlers implement the Robots Exclusion Protocol, which allows web masters todeclare parts of their sites off limits to crawlers [20]. The Robots Exclusion Protocol requires a webcrawler to fetch a special document containing these declarations from a web site before downloadingany real content from it. To avoid downloading this file on every request, Mercator’s HTTP protocol 4
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x