SQLite swivel table, setwise like SQL should be


I have some data. 224,000 rows of it, in a SQLite database. I want to extract time series information from it to feed a data visualisation tool. Essentially, each row in the db is an event that has (among other things not strictly relevant) a time-date group in seconds since the epoch and a name responsible for it. I want to extract how many events each name has for every week in the db.

That's simple enough:

       strf("%W:%Y", time, "unixepoch")
  FROM events
 GROUP BY strf("%W:%Y", time, "unixepoch"), name
 ORDER BY time

and we get about six thousand rows of data.

count          name        week:year
23............ fudge.......23:2009

But I don't want a row for each name in each week - I want a row for each name, and a column for each week, like this:

Name      23:2009       24:2009    25:2009

Now, the monitoring process has been running for 69 weeks, and the count of unique names is 502. So clearly, I'm far from keen on any solution that involves hardcoding all the columns and still less the rows. I'm less unkeen on anything that involves iterating over the lot, say with python's executemany(), but I'm willing to accept it if necessary. SQL is meant to be set-wise, dammit.

A good approach in cases like this is not to push SQL to the point where it becomes convoluted and hard to understand and maintain. Let SQL do what it conveniently can and post-process the query results in Python.

Here's a cut-down version of a simple crosstab generator that I wrote. The full version delivers row/column/grand totals.

You'll note that it has built-in "group by" -- the original use-case was for summarising data obtained from Excel files using Python and xlrd.

The row_key and col_key that you supply don't need to be strings as in the example; they can be tuples -- e.g. (year, week) in your case -- or they could be integers -- e.g. you have a mapping of string column name to integer sort key.

import sys

class CrossTab(object):

    def __init__(
        missing=0, # what to return for an empty cell. Alternatives: '', 0.0, None, 'NULL'
        self.missing = missing
        self.col_key_set = set()
        self.cell_dict = {}
        self.headings_OK = False

    def add_item(self, row_key, col_key, value):
            self.cell_dict[row_key][col_key] += value
        except KeyError:
                self.cell_dict[row_key][col_key] = value
            except KeyError:
                self.cell_dict[row_key] = {col_key: value}

    def _process_headings(self):
        if self.headings_OK:
        self.row_headings = list(sorted(self.cell_dict.iterkeys()))
        self.col_headings = list(sorted(self.col_key_set))
        self.headings_OK = True

    def get_col_headings(self):
        return self.col_headings

    def generate_row_info(self):
        for row_key in self.row_headings:
            row_dict = self.cell_dict[row_key]
            row_vals = [row_dict.get(col_key, self.missing) for col_key in self.col_headings]
            yield row_key, row_vals

    def dump(self, f=None, header=None, footer='', ):
        if f is None:
            f = sys.stdout
        alist = self.__dict__.items()
        if header is not None:
            print >> f, header
        for attr, value in alist:
            print >> f, "%s: %r" % (attr, value)
        if footer is not None:
            print >> f, footer

if __name__ == "__main__":

    data = [
        ['Rob', 'Morn', 240],
        ['Rob', 'Aft',  300],
        ['Joe', 'Morn',  70],
        ['Joe', 'Aft',   80],
        ['Jill', 'Morn', 100],
        ['Jill', 'Aft',  150],
        ['Rob', 'Aft',   40],
        ['Rob', 'aft',    5],
        ['Dozy', 'Aft',   1],
        # Dozy doesn't show up till lunch-time
        ['Nemo', 'never', -1],
    NAME, TIME, AMOUNT = range(3)
    xlate_time = {'morn': "AM", "aft": "PM"}

    ctab = CrossTab(missing=None, )
    # ctab.dump(header='=== after init ===')
    for s in data:
            col_key= xlate_time.get(s[TIME].lower(), "XXXX"),
        # ctab.dump(header='=== after add_item ===')
    print ctab.get_col_headings()
    # ctab.dump(header='=== after get_col_headings ===')
    for x in ctab.generate_row_info():
        print x


['AM', 'PM', 'XXXX']
('Dozy', [None, 1, None])
('Jill', [100, 150, None])
('Joe', [70, 80, None])
('Nemo', [None, None, -1])
('Rob', [240, 345, None])