How can I retrieve data from the Israeli Statistics Bureau Web Query Tool?

advertisements

The following url:

http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7

Gives a data generator of information from the Israeli government which limits the number of data points extracted to a maximum of 50 series at a time. I wonder, is it possible (and if so, how) to write a webscraper (in your favorite language/software) that can follow the clicks on each step to be able to get all of the series in a specific topic.

Thanks.


To submit the forms, you can use Python's mechanize module:

import mechanize
import pprint
import lxml.etree as ET
import lxml.html as lh
import urllib
import urllib2

browser=mechanize.Browser()
browser.open("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7")
browser.select_form(nr=0)

Here we peek at the options available:

pprint.pprint(browser.form.controls[-2].items)
# [<Item name='1' id=None selected='selected' contents='Volume of orders for the domestic market' value='1' label='Volume of orders for the domestic market'>,
#  <Item name='2' id=None contents='Orders for export' value='2' label='Orders for export'>,
#  <Item name='3' id=None contents='The volume of production' value='3' label='The volume of production'>,
#  <Item name='4' id=None contents='The volume of sales' value='4' label='The volume of sales'>,
#  <Item name='5' id=None contents='Stocks of finished goods' value='5' label='Stocks of finished goods'>,
#  <Item name='6' id=None contents='Access to credit for the company' value='6' label='Access to credit for the company'>,
#  <Item name='7' id=None contents='Change in the number of employees' value='7' label='Change in the number of employees'>]

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['1', '2', '3', '4', '5', '6', '7']

browser.form['name_tatser']=['2']
browser.submit()

We can repeat this for each of the subsequent forms:

browser.select_form(nr=1)

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['1576', '1581', '1594', '1595', '1596', '1598', '1597', '1593']

browser.form['name_ser']=['1576']
browser.submit()

browser.select_form(nr=2)

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['32', '33', '34', '35', '36', '37', '38', '39', '40', '41']

browser.form['data_kind']=['33']
browser.submit()

browser.select_form(nr=3)
browser.form['ybegin']=['2010']
browser.form['mbegin']=['1']
browser.form['yend']=['2011']
browser.form['mend']=['5']
browser.submit()

At this point you have three options:

  1. Parse the data from the HTML source
  2. Download an .xls file
  3. Download an XML file

I don't have any experience parsing .xls in Python, so I passed over this option.

Parsing the HTML is possible with BeautifulSoup or lxml. Perhaps this would have been the shortest solution, but finding the right XPaths for the HTML was not immediately clear to me, so I went for the XML:

To download the XML from the cbs.gov.il website, one clicks on a button that calls a javascript function. Uh oh -- mechanize can not execute javascript functions. Thankfully, the javascript merely assembles a new url. Pulling out the parameters with lxml is easy:

content=browser.response().read()
doc=lh.fromstring(content)
params=dict((elt.attrib['name'],elt.attrib['value']) for elt in doc.xpath('//input'))
params['king_format']=2
url='http://www.cbs.gov.il/ts/databank/data_ts_format_e.xml'
params=urllib.urlencode(dict((p,params[p]) for p in [
    'king_format',
    'tod',
    'time_unit_list',
    'mend',
    'yend',
    'co_code_list',
    'name_tatser_list',
    'ybegin',
    'mbegin',
    'code_list',
    'co_name_tatser_list',
    'level_1',
    'level_2',
    'level_3']))

browser.open(url+'?'+params)
content=browser.response().read()

Now we reach another stumbling block: the XML is encoded in iso-8859-8-i. Python does not recognize this encoding. Not knowing what to do, I simply replaced iso-8859-8-i with iso-8859-8. I don't know what bad side-effects this might cause.

# A hack, since I do not know how to deal with iso-8859-8-i
content=content.replace('iso-8859-8-i','iso-8859-8')
doc=ET.fromstring(content)

Once you get this far, parsing the XML is easy:

for series in doc.xpath('/series_ts/Data_Set/Series'):
    print(series.attrib)
    # {'calc_kind': 'Weighted',
    #  'name_ser': 'Number Of Companies That Answered',
    #  'get_time': '2011-06-21',
    #  'name_topic': "Business Tendency Survey - Distributions Of Businesses By Industry, Kind Of Questions And Answers  - Manufacturing - Company'S Experience Over The Past Three Months - Orders For Export",
    #  'time_unit': 'Month',
    #  'code_series': '22978',
    #  'data_kind': '5-10 Employed Persons',
    #  'decimals': '0',
    #  'unit_kind': 'Number'}

    for elt in series.xpath('obs'):
        print(elt.attrib)
        # {'time_period': ' 2010-12', 'value': '40'}
        # {'time_period': ' 2011-01', 'value': '38'}
        # {'time_period': ' 2011-02', 'value': '40'}
        # {'time_period': ' 2011-03', 'value': '36'}
        # {'time_period': ' 2011-04', 'value': '30'}
        # {'time_period': ' 2011-05', 'value': '33'}