I am currently trying to parse a Craigslist page using JSoup for an Android application. Here is the URL to a page that I am trying to parse:
When I inspect the elements using Chrome, I can see that the HTML structure for an ad is as follows:
<p class="row" data-pid="4711759405"> <a href="/see/ctd/4711759405.html" class="i" data-id="0:00U0U_d4iR9oMNMBY"> <img alt="" src="http://images.craigslist.org/00U0U_d4iR9oMNMBY_300x300.jpg"> </a> <span class="txt"> <span class="star v" title="save this post in your favorites list"></span> <span class="pl"> ....
Using JSoup, I am able to parse everything EXCEPT for the img tag. Here is how I am making the HTTP request:
document = Jsoup.connect(url).get(); Elements images = document.select("img");
This method will only find 2 images, none of which are ad images. I also used the Chrome plugin POSTMAN in order to replicate an HTTP GET request, and I find that there are no img tags for any of the ads. Why is this happening and how can I retrieve the img tag src URL?
Note that I am able to retrieve everything else, but the img tags.
However, there is a mapping between the
data-id property of the
a element in the HTML structure you posted, and the
src property of the generated
img tag. For example, let us consider the following element:
<a href="/see/ctd/4711759405.html" class="i" data-id="0:00U0U_d4iR9oMNMBY">
Just retrieve the
data-id property from the
a element, remove the part before the colon, add
_300x300.jpg at the end, and you get the name if the image file. The full URL then becomes:
So, instead of selecting
img elements with JSoup, select
a elements and construct the image URLS from their