How do I create a view on a table containing record fields?

advertisements

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.

To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.

This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.

For example, I have this table:

For which I issue this API call:

{
  'tableReference': {
    'projectId': 'redacted',
    'tableId': u'AccountDeletionRequest',
    'datasetId': 'Latest_Production_Data'
  }
  'view': {
    'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
  },
}

This results in the following error:

HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">

When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.

Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?

Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.


Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:

def get_leaf_column_selectors(dataset, table):
    schema = table_service.get(
            projectId=BQ_PROJECT_ID,
            datasetId=dataset,
            tableId=table
        ).execute()['schema']

    return ",\n".join([
        _get_leaf_selectors("", top_field)
        for top_field in schema["fields"]
    ])

def _get_leaf_selectors(prefix, field):
    if prefix:
        format = prefix + ".%s"
    else:
        format = "%s"

    if 'fields' not in field:
        # Base case
        actual_name = format % field["name"]
        safe_name = actual_name.replace(".", "_")
        return "%s as %s" % (actual_name, safe_name)
    else:
        # Recursive case
        return ",\n".join([
            _get_leaf_selectors(format % field["name"], sub_field)
            for sub_field in field["fields"]
        ])