I have some data. 224,000 rows of it, in a SQLite database. I want to extract time series information from it to feed a data visualisation tool. Essentially, each row in the db is an event that has (among other things not strictly relevant) a time-date group in seconds since the epoch and a name responsible for it. I want to extract how many events each name has for every week in the db.
That's simple enough:
SELECT COUNT(*),
name,
strf("%W:%Y", time, "unixepoch")
FROM events
GROUP BY strf("%W:%Y", time, "unixepoch"), name
ORDER BY time
and we get about six thousand rows of data.
count name week:year
23............ fudge.......23:2009
etc...
But I don't want a row for each name in each week - I want a row for each name, and a column for each week, like this:
Name 23:2009 24:2009 25:2009
fudge........23............6............19
fish.........1.............0............12
etc...
Now, the monitoring process has been running for 69 weeks, and the count of unique names is 502. So clearly, I'm far from keen on any solution that involves hardcoding all the columns and still less the rows. I'm less unkeen on anything that involves iterating over the lot, say with python's executemany(), but I'm willing to accept it if necessary. SQL is meant to be set-wise, dammit.
A good approach in cases like this is not to push SQL to the point where it becomes convoluted and hard to understand and maintain. Let SQL do what it conveniently can and post-process the query results in Python.
Here's a cut-down version of a simple crosstab generator that I wrote. The full version delivers row/column/grand totals.
You'll note that it has built-in "group by" -- the original use-case was for summarising data obtained from Excel files using Python and xlrd.
The row_key
and col_key
that you supply don't need to be strings as in the example; they can be tuples -- e.g. (year, week)
in your case -- or they could be integers -- e.g. you have a mapping of string column name to integer sort key.
import sys
class CrossTab(object):
def __init__(
self,
missing=0, # what to return for an empty cell. Alternatives: '', 0.0, None, 'NULL'
):
self.missing = missing
self.col_key_set = set()
self.cell_dict = {}
self.headings_OK = False
def add_item(self, row_key, col_key, value):
self.col_key_set.add(col_key)
try:
self.cell_dict[row_key][col_key] += value
except KeyError:
try:
self.cell_dict[row_key][col_key] = value
except KeyError:
self.cell_dict[row_key] = {col_key: value}
def _process_headings(self):
if self.headings_OK:
return
self.row_headings = list(sorted(self.cell_dict.iterkeys()))
self.col_headings = list(sorted(self.col_key_set))
self.headings_OK = True
def get_col_headings(self):
self._process_headings()
return self.col_headings
def generate_row_info(self):
self._process_headings()
for row_key in self.row_headings:
row_dict = self.cell_dict[row_key]
row_vals = [row_dict.get(col_key, self.missing) for col_key in self.col_headings]
yield row_key, row_vals
def dump(self, f=None, header=None, footer='', ):
if f is None:
f = sys.stdout
alist = self.__dict__.items()
alist.sort()
if header is not None:
print >> f, header
for attr, value in alist:
print >> f, "%s: %r" % (attr, value)
if footer is not None:
print >> f, footer
if __name__ == "__main__":
data = [
['Rob', 'Morn', 240],
['Rob', 'Aft', 300],
['Joe', 'Morn', 70],
['Joe', 'Aft', 80],
['Jill', 'Morn', 100],
['Jill', 'Aft', 150],
['Rob', 'Aft', 40],
['Rob', 'aft', 5],
['Dozy', 'Aft', 1],
# Dozy doesn't show up till lunch-time
['Nemo', 'never', -1],
]
NAME, TIME, AMOUNT = range(3)
xlate_time = {'morn': "AM", "aft": "PM"}
print
ctab = CrossTab(missing=None, )
# ctab.dump(header='=== after init ===')
for s in data:
ctab.add_item(
row_key=s[NAME],
col_key= xlate_time.get(s[TIME].lower(), "XXXX"),
value=s[AMOUNT])
# ctab.dump(header='=== after add_item ===')
print ctab.get_col_headings()
# ctab.dump(header='=== after get_col_headings ===')
for x in ctab.generate_row_info():
print x
Output:
['AM', 'PM', 'XXXX']
('Dozy', [None, 1, None])
('Jill', [100, 150, None])
('Joe', [70, 80, None])
('Nemo', [None, None, -1])
('Rob', [240, 345, None])