Do concurrent Web crawlers usually store visited URLs in a concurrent map or use synchronization to avoid tweaking the same pages?

advertisements

I'm playing around writing a simple multi-threaded web crawler. I see a lot of sources talk about web crawlers as obviously parallel because you can start crawling from different URLs, but I never see them discuss how web crawlers handle URLs that they've already seen before. It seems that some sort of global map would be essential to avoid re-crawling the same pages over and over, but how would the critical section be structured? How fine grained can the locks be to maximize performance? I just want to see a good example that's not too dense and not too simplistic.


If you insist to do it using only java concurrency framework, then the ConcurrentHashMap may be the way to go. The interesting method in it is the ConcurrentHashMap.putIfAbsent method, it will give you very good efficiency, and the idea how to use it is:

You will have some "multithreaded source of incoming url addresses" from crawled pages - you can use some concurrent queue to store them, or just create a ExecutorService with (unbounded?) queue in which you will place Runnables that will crawl the urls.

Inside the crawling Runnables you should have a reference to this common ConcurrentHashMap of already crawled pages, and at the very begin of the run method do:

private final ConcurrentHashMap<String, Long> crawledPages = new ConcurrentHashMap<String, Long>();
...

private class Crawler implements Runnable {
  private String urlToBeCrawled;

  public void Crawler(String urlToBeCrawled) {
    this.urlToBeCrawled = urlToBeCrawled;
  }

  public void run() {
    if (crawledPages.putIfAbsent(urlToBeCrawled, System.currentTimeMillis())==null) {
       doCrawlPage(urlToBeCrawled);
    }
  }
}

if crawledPages.putIfAbsent(urlToBeCrawled) will return null to you, then you know that this page was not crawled by anyone, since this method atomically puts the value you can progress with crawling this page - you're the lucky thread, if it will return a non-null value, then you know someone has already take care about this url, so your runnable should finish, and the thread goes back to pool to be used by next Runnable.