The most reliable way to get content from a web page?


I am trying to find out the most reliable way of getting the content of a web page with url as an input parameter In Java?

Things I have tried are : 1. JSOUP 2. HtmlUnit 3. URL and URLConnection

The problem with 1 & 2 is they sometimes throw SocketTimeoutException or behaves unpredictably even if it is possible to get the contents of the page (even if robots.txt allows it).

Using 3 I am not able to get the loadTime without using the subtraction of milliseconds. that is the biggest problem, which yields inaccurate results. Also for getting the content I need to use Streams and read the content line by line.

The current implementation uses Approach #2. Which has capability of LoadTime and contentType.

For Every web page's content I need contentType, loadTime etc.

Basically it is for a Link Validator project which validates entire website including css's background-image also images, js, html etc. so based on contentType I filter and parse only HTML's content.

P.S. Improving timeout more than 9 seconds will make the link validation slow. so my current timeout is 9 seconds.

I need help, as I want to make my link validation tool as reliable as possible.

It sounds like your problem is divided into two parts:

  1. How do I get the content from the remote server

  2. How do I then parse the content to do my link validation

And that your question is really about part 1 but you are taking both parts 1 and 2 at the same time. This may be part of your issue.

The real issue is reading the remote content. All three approaches are really reading the content with the exact same background API, namely the JRE's built in URLConnection based solution. URLConnection is ok, but not really what you would want to use on a real network.

There are a couple of better libraries that you can use for getting the content of remote resources over the HTTP protocol...

  1. Netty from JBoss
  2. HttpComponents from Apache
  3. AsyncHttpClient from Jean-Francois

I find that AsyncHttpClient is far and away the best one to use, even in blocking mode. It has a very nice API for getting the pages, and it works well with multi-threading. You should find it easy to get the total load time, and more importantly you should be able to make a lot of the work happen in parallel.

You will in essence use AsyncHttpClient to load the content and then pass that content onto JSoup (or whatever you prefer... JSoup is the one I would recommend) and do the parsing there.

The mistake is not in using JSoup, or HtmlUnit but rather in trying to use them to do everything. These are tools designed to do one thing and do that one thing well... you need to do two things, so use two tools, each optimized for the task at hand.