Is there an analog for
reduce for a pandas Series?
For example, the analog for
map is pd.Series.apply, but I can't find any analog for
My application is, I have a pandas Series of lists:
>>> business["categories"].head() 0 ['Doctors', 'Health & Medical'] 1 ['Nightlife'] 2 ['Active Life', 'Mini Golf', 'Golf'] 3 ['Shopping', 'Home Services', 'Internet Servic... 4 ['Bars', 'American (New)', 'Nightlife', 'Loung... Name: categories, dtype: object
I'd like to merge the Series of lists together using
reduce, like so:
categories = reduce(lambda l1, l2: l1 + l2, categories)
but this takes a horrific time because merging two lists together is
O(n) time in Python. I'm hoping that
pd.Series has a vectorized way to perform this faster.
itertools.chain() on the values
This could be faster:
from itertools import chain categories = list(chain.from_iterable(categories.values))
from functools import reduce from itertools import chain categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000) %timeit list(chain.from_iterable(categories.values)) 1000 loops, best of 3: 231 µs per loop %timeit list(chain(*categories.values.flat)) 1000 loops, best of 3: 237 µs per loop %timeit reduce(lambda l1, l2: l1 + l2, categories) 100 loops, best of 3: 15.8 ms per loop
For this data set the
chaining is about 68x faster.
Vectorization works when you have native NumPy data types (pandas uses NumPy for its data after all). Since we have lists in the Series already and want a list as result, it is rather unlikely that vectorization will speed things up. The conversion between standard Python objects and pandas/NumPy data types will likely eat up all the performance you might get from the vectorization. I made one attempt to vectorize the algorithm in another answer.