Numpy Array with different types of features for sci-kitLearn dataset API in Python

advertisements

I am trying to perform machine learning using sci-kitLearn on a dataset parsed from a json file. To use the dataset API in sci-kitLearn I need a Numpy array of shape (n_samples * n_features).

I have this data encoded as a nested Python list where the list is size 'X' (some large amount of samples) and each element is of type [int, float, int] (3 features).

Ex: [ [int, float, int] , [int, float, int] , ... ]

I need to convert this into a numpy array that will will function properly with the sci-kitLearn dataset but I cannot seem to create a numpy array that supports different types for each column.

Numpy arrays are genearally homogenous but I find it hard to believe that different types of features/columns in a dataset remains a flaw in using this API and I have seen examples where different types of features are used.

The documentation on loading your own dataset is poor : http://scikit-learn.org/stable/tutorial/basic/tutorial.html. Any help created the numpy array and/or using the dataset API would be greatly appreciated.

My code is posted below (although the problem is what to do next) :

with open('bc_mp_at_blockchain.json') as data_:
mp_json = json.load(data_)

with open('bc_tv_at_blockchain.json') as data:
    tv_json = json.load(data)

# access dictionary of length 1 that list of values
list_of_mpdata = mp_json['values']
list_of_tvdata = tv_json['values']

# ensure both sets of data start on the same day
assert ( list_of_mpdata[0]['x'] == list_of_tvdata[0]['x'] )

#concatenate lists as necessary
combined_list = []
for mp_dict, tv_dict in zip(list_of_mpdata, list_of_tvdata) :
    combined_list.append([ mp_dict['x'], mp_dict['y'], tv_dict['y'] ])

# combined_list is now a list of [int,float,int] lists


If you have a list of lists you can convert this to a numpy array with np.array(combined_list). This will be in the shape where the length of the outer list is in the first dimension (down), e.g.

>>> a = np.array([[1,2,3],[1,2,3]])
>>> a.shape
(2, 3)

If I understand correctly that should be the correct n_samples*n_features order for scikit, but if not you can transpose the array using:

>>> a = a.T
>>> a.shape
(3, 2)