Manage a website and then download all files of a specific type based on the appearance of a text string on a page


I believe I have a rather unique question. Or at least I was unable to find a solution to a similar problem.

I want to crawl a website and then on each page I want to search for a particular text string. If the text string is found I want to download all files of a specific type (PDF) that are linked to that page.

I would appreciate a complete answer but if anybody could just point me in the direction of the necessary software or framework that could to accomplish this it would be greatly appreciated.

There is no direct software for doing this at one go, unless you are the owner of Google or Yahoo who can and do crawl websites on a regular basis.

Anyways jokes apart, with a little bit of programming you can easily do that. There is no need for any frameworks or any such thing.

You will need:

  1. Any LAMPP package XAMP, WAMP.

  2. CURL to get the pages

  3. Regex to Parse the pages.(Regex Buddy)

  4. wget to download the files or whatever you want to download.

You can easily check up on each of these by a simple search on Google. Curl will help you get the html files and store them as a string in a variable. Next you can use the preg_match or ereg_match functions in PHP to find the exact string and if present send a system call to wget to download the file. The linked website has a software which will help you gain a lot of info on the Regular Expressions (regex),