loadtxt of numpy gives the conversion error to int when the string contains decimal

advertisements

I'm having trouble trying to load a txt file into a structured array.

Here's a simple example showing the problem.

This works fine:

import numpy as np
from StringIO import StringIO 

in1 = StringIO("123 456 789\n231 543 876")
a = np.loadtxt(in1, dtype=[('x', "int"), ('y', "int"), ('z', "int")])

####output
array([(123, 456, 789), (231, 543, 876)],
      dtype=[('x', '<i8'), ('y', '<i8'), ('z', '<i8')])

But when one of the fields contains a decimal I get an error trying to convert it to an int:

in2 = StringIO("123 456 789\n231 543.0 876")
a = np.loadtxt(in2, dtype=[('x', "int"), ('y', "int"), ('z', "int")])

####error
ValueError: invalid literal for long() with base 10: '543.0'

I want python to be able to convert a number like "543.0" into 543 without throwing an error.

If it was just a single number I could use something like

int(float("543.0"))

But can I do this in combination with numpy's loadtxt?

In practice, the file I'm trying to read is about 2Gigs, and has a complicated dtype of length 37 containing a mixture of floats, strings, and ints.

I've tried numpy.genfromtxt, which seems to work for smaller files, but it eats up too much memory on the 2gig file.

Another option I've considered is to truncate all the numbers that end in ".0" with sed, which will work, but is more of a hack than a real solution.

Is there a more pythonic approach?

Answered (thanks Zhenya)...

dtypeTmp = np.dtype([(d[0], "<f8") if d[1] == "<i8" else d for d in dtype1.descr])
events = np.loadtxt("file.txt", dtype=dtypeTmp)
events.astype(dtype1)


For the fields that should be integers, you could use a converter that does int(float(fieldval)). The following shows one way you could create the loadtxt converters argument programmatically, based on the dtype:

In [77]: in3 = StringIO("123.0 456 789 0.95\n231 543.0 876 0.87")

In [78]: dt = dtype([('x', "int"), ('y', "int"), ('z', "int"), ('r', "float")])

In [79]: converters = dict((k, lambda s: int(float(s))) for k in range(len(dt)) if np.issubdtype(dt[k], np.integer))

In [80]: converters
Out[80]:
{0: <function __main__.<lambda>>,
 1: <function __main__.<lambda>>,
 2: <function __main__.<lambda>>}

In [81]: a = np.loadtxt(in3, dtype=dt, converters=converters)

In [82]: a
Out[82]:
array([(123, 456, 789, 0.95), (231, 543, 876, 0.87)],
      dtype=[('x', '<i8'), ('y', '<i8'), ('z', '<i8'), ('r', '<f8')])

Even with this, you might still run into performance or memory problems when using loadtxt on a 2 gig file. Have you looked into pandas? Its csv reader is much faster than the readers in numpy.