How do I wait through a page and then download a PDF, using Python?

advertisements

The Problem

I'm trying to download PDF files from a website built over a crotchety old mainframe, and in order to support the traffic the website implements wait pages. The wait page will render and you will spend a few second looking at that instead of the PDF you want, and then it will disappear and you'll go where you want to go.

Here's my scenario:

  1. I go to the page.
  2. Maybe 33% of the time, I get the wait page. Here's the wait page code:

<div id="wrapper">
    <p><hr /></p>
    </p>
        <div id="waiting-main">
            <p style="text-align: center; margin: 6px 0 15px 0;"><img src="/ns_images2/doblogo_1.jpg" border="0" />
            </p>
            <p style="text-align: center; font-size: 30px; line-height: 34px;">Just a moment</p>
            <p style="text-align: left; color: #525252; font-size: 20px; line-height: 22px;">
            Your request is being processed.</br></br>

            Due to the high demand it may take a little longer. You will be directed to the page shortly. Please do not leave this page. Refreshing the page will delay the response time. We apologize for the delay.</br></br>

            ...[snipped for brevity]...

            </p>

        </div>

    </div>

</body></html>


  1. The wait page exits out and I load the following HTML:

<html><body marginwidth="0" marginheight="0" style="background-color: rgb(38,38,38)"><embed width="100%" height="100%" name="plugin" src="http://a810-bisweb.nyc.gov/bisweb/CofoDocumentContentServlet?passjobnumber=null&amp;cofomatadata1=cofo&amp;cofomatadata2=M&amp;cofomatadata3=000&amp;cofomatadata4=092000&amp;cofomatadata5=M000092531.PDF&amp;requestid=5" type="application/pdf"><div id="annotationContainer"><style>#annotationContainer {    overflow: hidden;     position: absolute;     pointer-events: none;     top: 0;     left: 0;     right: 0;     bottom: 0;     display: -webkit-box;     -webkit-box-align: center;     -webkit-box-pack: center; } .annotation {     position: absolute;     pointer-events: auto; } textarea.annotation {     resize: none; } input.annotation[type='password'] {     position: static;     width: 200px;     margin-top: 100px; } </style></div></body></html>


  1. I download the PDF document locally. The end!

My attempted solution

Not knowing that selenium doesn't really support PDFs (or does it?), this is my approach:

_driver = webdriver.PhantomJS()

...
req_string = ...[a very long URL]...
_driver.get(req_str)
...

try:
    WebDriverWait(_driver, 10).until(
        # Cannot use:
        # lambda a: not a.presence_of_element_located((By.ID, "waiting-main"))
        # Because:
        # https://blog.mozilla.org/webqa/2012/07/12/how-to-webdriverwait/
        # Which suggests this working alternative.
        lambda s: len(s.find_elements(By.ID, "waiting-main")) == 0
    )
finally:
    _driver.save_screenshot("test.png") # Maybe?
    # How do I get the actual PDF code? :/

The question

I can't see a way to do this with selenium. So my question is:

How can I load a page, wait through a wait page, and then download the PDF that comes afterwards using Python (2.7)?

Alternatively, if this is possible with selenium, how can I do it?

The example

The link on this page exemplifies my problem.

The workaround

For now I'm using:

r = requests.get(req_str)
while "waiting-main" in r.text:
    time.sleep(5)
    r = requests.get(req_str)

No word yet on how well it works...

The page


I can get the page source consistently using requests, this will get the pdf link and save it:

from  bs4 import BeautifulSoup
import requests
from urlparse import urljoin

# gets the page when you click the pdf link in your browser
post_url = "http://a810-bisweb.nyc.gov/bisweb/CofoJobDocumentServlet"
base = "http://a810-bisweb.nyc.gov/bisweb/"
r = requests.get("http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=4&allbin=1006360")

soup = BeautifulSoup(r.content)
# parse the form key/value pairs
form_data = {inp["name"]: inp["value"] for inp in soup.select("form[action=CofoJobDocumentServlet] input")}
# post to from data
nr = requests.post(post_url, data=form_data)
soup = BeautifulSoup(nr.content)

# get the link to the pdf to download
pdf = urljoin(base, soup.select_one("iframe")["src"])

# save pdf to file.
with open("out.pdf","wb") as out:
    out.write(requests.get(pdf).content)

If you are experiencing the waits issues, you can wait until the form is visible with selenium and pass the source to bs4:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def wait(dr, x, t):
    element = WebDriverWait(dr, t).until(
        EC.presence_of_all_elements_located((By.XPATH, x))
    )
    return element

dr = webdriver.PhantomJS()
dr.get("http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=4&allbin=1006360")

wait(dr, "//form[@action='CofoJobDocumentServlet']", 30)

post_url = "http://a810-bisweb.nyc.gov/bisweb/CofoJobDocumentServlet"
base = "http://a810-bisweb.nyc.gov/bisweb/"

soup = BeautifulSoup(dr.page_source)

form_data = {inp["name"]: inp["value"] for inp in soup.select("form[action=CofoJobDocumentServlet] input")}

nr = requests.post(post_url, data=form_data)
soup = BeautifulSoup(nr.content)

pdf = urljoin(base, soup.select_one("iframe")["src"])

with open("out.pdf","wb") as out:
    out.write(requests.get(pdf).content)