I have over 65 million numeric values stored in a text file. I need to compute the maximum, minimum, average, standard deviation, as well as the 25, 50, and 75 percentiles.
Normally I would use the attached code, but I need a more efficient way to compute these metrics because i cannot store all value p in a list. How can I more effectively calculate these values in Python?
import numpy as np np.average(obj) np.min(mylist) np.max(mylist) np.std(mylist) np.percentile(obj, 25) np.percentile(obj, 50) np.percentile(obj, 75) maxx = float('-inf') minx = float('+inf') sumz = 0 for index, p in enumerate(open("foo.txt", "r")): maxx = max(maxx, float(p)) minx = min(minx, float(p)) sumz += float(p) index += 1 my_max = maxx my_min = minx my_avg = sumz/index
Use binary file. Then you can use
numpy.memmap to map it to memory and can perform all sorts of algorithms, even if the dataset was larger than RAM.
You can even use the numpy.memmap to create a memory mapped array, and read your data in from the text file... you can work on it and when you are done, you also have the data in binary format.