I am trying to perform machine learning using sci-kitLearn on a dataset parsed from a json file. To use the dataset API in sci-kitLearn I need a Numpy array of shape (n_samples * n_features).
I have this data encoded as a nested Python list where the list is size 'X' (some large amount of samples) and each element is of type [int, float, int] (3 features).
Ex: [ [int, float, int] , [int, float, int] , ... ]
I need to convert this into a numpy array that will will function properly with the sci-kitLearn dataset but I cannot seem to create a numpy array that supports different types for each column.
Numpy arrays are genearally homogenous but I find it hard to believe that different types of features/columns in a dataset remains a flaw in using this API and I have seen examples where different types of features are used.
The documentation on loading your own dataset is poor : http://scikit-learn.org/stable/tutorial/basic/tutorial.html. Any help created the numpy array and/or using the dataset API would be greatly appreciated.
My code is posted below (although the problem is what to do next) :
with open('bc_mp_at_blockchain.json') as data_: mp_json = json.load(data_) with open('bc_tv_at_blockchain.json') as data: tv_json = json.load(data) # access dictionary of length 1 that list of values list_of_mpdata = mp_json['values'] list_of_tvdata = tv_json['values'] # ensure both sets of data start on the same day assert ( list_of_mpdata['x'] == list_of_tvdata['x'] ) #concatenate lists as necessary combined_list =  for mp_dict, tv_dict in zip(list_of_mpdata, list_of_tvdata) : combined_list.append([ mp_dict['x'], mp_dict['y'], tv_dict['y'] ]) # combined_list is now a list of [int,float,int] lists
If you have a list of lists you can convert this to a
numpy array with
np.array(combined_list). This will be in the shape where the length of the outer list is in the first dimension (down), e.g.
>>> a = np.array([[1,2,3],[1,2,3]]) >>> a.shape (2, 3)
If I understand correctly that should be the correct n_samples*n_features order for scikit, but if not you can transpose the array using:
>>> a = a.T >>> a.shape (3, 2)