retrieve values ​​from the dictionary. Range matching on keys


I created a dictionary myDict holding 10 million entries in the following form. Each entry in the dictionary represent {(id, age): code}

>>> myDict = {('1039', '68.0864'): '42731,42781,V4501',
              ('1039', '68.1704'): '4770,4778,V071',
              ('0845', '60.4476'): '2724,27800,4019',
              ('0983', '63.3936'): '41401,4168,4240,V1582,V7281'

A constant ageOffset is defined with value = 0.1

Given an (id,age) tuple, how can I fetch all values from myDict which have key (id, X) where:

age <= X <= age+ageOffset

I need to perform this fetch operation 20 billion times.

myTup = ('1039', '68.0')
the answer is: '42731,42781,V4501'

myTup = ('0845', '60.0')
Ans : No value returned

Edit: Can I create a sub-dictionary, on the basis of partial match on the first element of the Key. I mean, If first element of the tuple Key matched, then create a subdictionary. According to my data, this wont be longer than a couple of hundreds. And then perform linear range search comparing the second element in the tuple key and finding the corresponding values.

To do this operation 20 billion(!) times, you will have to preprocess your data a bit.

First, I would group by id:

def preprocess(data):
    from collections import defaultdict # Python 2.5+ only
    preprocessed = defaultdict(list)
    # group by id
    for (id, age), value in data.iteritems():
        preprocessed[id].append((float(age), value))
    # sort lists for binary search, see edit
    for key, value in preprocessed.iteritems():
    return preprocessed

Result should look like this:

>>> preprocess(myDict)
defaultdict(<type 'list'>, {
    '0845': [(60.4476, '2724,27800,4019')],
    '0983': [(63.3936, '41401,4168,4240,V1582,V7281')],
    '1039': [(68.0864, '42731,42781,V4501'), (68.1704, '4770,4778,V071')]}

If relatively few items share the same id, thus resulting in short lists, you might get away with filtering the list.

def lookup(data, id, age, age_offset=0.1):
    if id in data:
        return [value for x, value in data[id] if age <= x <= age+age_offset]
        return None     

lookup(preprocessed, '1039', 68.0) # Note that I use floats for age

However, if many items share the same id, you will have to traverse long lists, making the lookup relatively slow. In this case, you will have to apply further optimizations.

Edit: as suggested by @Andrey Petrov

from bisect import bisect_left
from itertools import islice, takewhile
def optimized_lookup(data, id, age, age_offset=0.1):
    if id in data:
        l = data[id]
        idx = bisect_left(l, age)
        return [a for a,v in takewhile(lambda (x, value): x <= age+age_offset, islice(l, idx, None))]
        return None