Get Craigslist img src using JSoup

advertisements

I am currently trying to parse a Craigslist page using JSoup for an Android application. Here is the URL to a page that I am trying to parse:

http://seattle.craigslist.org/search/sss?query=ford&sort=rel

When I inspect the elements using Chrome, I can see that the HTML structure for an ad is as follows:

<p class="row" data-pid="4711759405">
    <a href="/see/ctd/4711759405.html" class="i" data-id="0:00U0U_d4iR9oMNMBY">
        <img alt="" src="http://images.craigslist.org/00U0U_d4iR9oMNMBY_300x300.jpg">
    </a>
    <span class="txt">
        <span class="star v" title="save this post in your favorites list"></span>
        <span class="pl">
    ....

Using JSoup, I am able to parse everything EXCEPT for the img tag. Here is how I am making the HTTP request:

document = Jsoup.connect(url).get();
Elements images = document.select("img");

This method will only find 2 images, none of which are ad images. I also used the Chrome plugin POSTMAN in order to replicate an HTTP GET request, and I find that there are no img tags for any of the ads. Why is this happening and how can I retrieve the img tag src URL?

Note that I am able to retrieve everything else, but the img tags.


The ad images on the URL you gave are loaded using JavaScript after the page is loaded, that's why the initial HTML source does not contain any img tags.

However, there is a mapping between the data-id property of the a element in the HTML structure you posted, and the src property of the generated img tag. For example, let us consider the following element:

<a href="/see/ctd/4711759405.html" class="i" data-id="0:00U0U_d4iR9oMNMBY">

Just retrieve the data-id property from the a element, remove the part before the colon, add _300x300.jpg at the end, and you get the name if the image file. The full URL then becomes:

http://images.craigslist.org/00U0U_d4iR9oMNMBY_300x300.jpg

So, instead of selecting img elements with JSoup, select a elements and construct the image URLS from their data-id attributes.

Another solution would be to load the page in a WebView so that JavaScript gets executed, but I strongly discourage this over performance concerns.