How to read the csv file correctly if each line contains a different number of fields (large enough number)?

advertisements

I have a text file from amazon, containing the following info:

 #      user        item     time   rating     review text (the header is added by me for explanation, not in the text file
  disjiad123    TYh23hs9     13160032    5     I love this phone as it is easy to use
  hjf2329ccc    TGjsk123     14423321    3     Suck restaurant

As you see, the data is separated by space and there are different number of columns in each row. However, so it is the text content. Here is the code I have tried:

pd.read_csv(filename, sep = " ", header = None, names = ["user","item","time","rating", "review"], usecols = ["user", "item", "rating"])#I'd like to skip the text review part

And such an error occurs:

ValueError: Passed header names mismatches usecols

When I tried to read all the columns:

pd.read_csv(filename, sep = " ", header = None)

And the error this time is:

Error tokenizing data. C error: Expected 229 fields in line 3, saw 320

And given the review text is so long in many rows , the method of adding header names for each column in this question can not work.

I wonder how to read the csv file if I want to keep the review text and skip them respectively. Thank you in advance!

EDIT:

The problem has been solved by Martin Evans perfectly. But now I am playing with another data set with similar but different format. Now the order of the data is converse:

     # review text                          user        item     time   rating      (the header is added by me for explanation, not in the text file
   I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5
  Suck restaurant                           hjf2329ccc    TGjsk123     14423321    3

Do you have any idea to read it properly? It would be appreciated for any help!


As suggested, DictReader could also be used as follows to create a list of rows. This could then be imported as a frame in pandas:

import pandas as pd
import csv

rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']

with open('input.csv', 'rb') as f_input:
    for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
        try:
            rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
        except KeyError, e:
            rows.append([row['user'], row['item'], row['rating'], ' '])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display the following:

         user      item rating                                  review
0  disjiad123  TYh23hs9      5  I love this phone as it is easy to use
1  hjf2329ccc  TGjsk123      3                         Suck restaurant


If the review appears at the start of the row, then one approach would be to parse the line in reverse as follows:

import pandas as pd
import csv

rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']

with open('input.csv', 'rb') as f_input:
    for row in f_input:
        cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
        rows.append(cols[:4] + [' '.join(cols[4:][::-1])])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display:

  rating      time      item        user  \
0      5  13160032  TYh23hs9   isjiad123
1      3  14423321  TGjsk123  hjf2329ccc   

                                    review
0  I love this phone as it is easy to used
1                          Suck restaurant

row[::-1] is used to reverse the text of the whole line, the [2:] skips over the line ending which is now at the start of the line. Each line is then split on spaces. A list comprehension then re-reverses each split entry. Finally rows is appended to first by taking the fixed 5 column entries (now at the start). The remaining entries are then joined back together with a space and added as the final column.

The benefit of this approach is that it does not rely on your input data being in an exactly fixed width format, and you don't have to worry if the column widths being used change over time.