Performance of large data structures in Python

advertisements

I'm looking for some help understanding the performance characteristics of large lists, dicts or arrays in Python. I have about 1M key value pairs that I need to store temporarily (this will grow to maybe 10M over the next year). They keys are database IDs ranging from 0 to about 1.1M (with some gaps) and the values are floats.

I'm calculating pagerank, so my process is to initialize each ID with a value of 1, then look it up in memory and update it about ten times before saving it back to the database.

  1. I'm theorizing that lists or arrays will be fastest if I use the database ID as the index of the array/list. This will create a gappy data structure, but I don't understand how fast look ups or updates will be. I also don't yet understand if there's a big gain to get from using arrays instead of lists.

  2. Using a dict for this is very natural, with key-value pairs, but I get the impression building the dict the first time would be very slow and memory intensive as it grows to accommodate all the entries.

  3. I also read that SQLite might be a good solution for this using the :memory: flag, but I haven't dug into that too much yet.

Anyway, just looking for some guidance here. Any thoughts would be much appreciated as I'm digging in.


Just start with a dictionary. Even if you are running on WinXP 10 million keys shouldn't be a problem. But I hope for your sake that you aren't :)

A dictionary will be easier to code and probably faster to build and update especially if you are updating the values in random order.

It's often best to start coding a prototype and use that to identify performance issues. Your bottleneck will most likely be wherever you are requesting the data from. Not entering or retrieving it from a dictionary.