An efficient way of computer statistics for a large amount of inaccurate data


I have over 65 million numeric values stored in a text file. I need to compute the maximum, minimum, average, standard deviation, as well as the 25, 50, and 75 percentiles.

Normally I would use the attached code, but I need a more efficient way to compute these metrics because i cannot store all value p in a list. How can I more effectively calculate these values in Python?

import numpy as np

np.percentile(obj, 25)
np.percentile(obj, 50)
np.percentile(obj, 75)

maxx = float('-inf')
minx = float('+inf')
sumz = 0
for index, p in enumerate(open("foo.txt", "r")):
    maxx = max(maxx, float(p))
    minx = min(minx, float(p))
    sumz += float(p)
index += 1
my_max = maxx
my_min = minx
my_avg = sumz/index

Use binary file. Then you can use numpy.memmap to map it to memory and can perform all sorts of algorithms, even if the dataset was larger than RAM.

You can even use the numpy.memmap to create a memory mapped array, and read your data in from the text file... you can work on it and when you are done, you also have the data in binary format.