How can I read a CSV file if only non-empty fields are enclosed in double quotation marks?

advertisements

I'm trying to read a CSV file in a Bash script. I achieved that successfully using gawk and specifying FPAT like:

gawk -v LOGFILE="${LOGFILE}" 'BEGIN {
    FPAT = "([^,]+)|(\"[^\"]+\")"
    }
NR == 1{
    # doing some logic with header
}
NR >= 2{
    # doing some logic with fields
}' <filename>

The problem here is, the file contains data like:

"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"

Now, with this data I'm getting wrong data because it is ignoring commas, which is giving me wrong position number of extracted data. For example, it is telling "7865431234" is present at 3rd position whereas it is at 6th.

Can anyone suggest the changes to get the correct position of fields?


Your FPAT requires each field to contain at least one character, but you want to recognize empty fields with zero characters. Add an alternative to FPAT that allows zero characters:

gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("[%s]", $i); print "" }'

Note the extra | at the end of FPAT. The action simply identifies the record number, the number of fields, and surrounds the value of each field with square brackets.

When your data string is provided to that script, the output is:

1:8:["RAM"]["31st street, Bengaluru, India"][][][]["7865431234"][]["VALID"]

That shows the four empty fields quite clearly.

Now all you have to do is deal with:

"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,"",,,"INVALID"

where there are double quotes inside the quoted value. That's not dreadfully hard to manage:

gawk 'BEGIN { FPAT = "([^,]+)|(\"([^\"]|\"\")*\")[^,]*|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("%d[%s]", i, $i); print "" }' "[email protected]"

The FPAT says that a field is:

  • a sequence of non-commas,
  • or it is a field started with a double quote, containing zero or more instances of either:

    • a non-quote, or
    • two double quotes

    followed by a double quote and optional non-comma data

  • or it is empty

Note that the 'optional non-comma data' should be empty, and only appears in malformed CSV data.

Given input data:

"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,,,,"INVALID"
"Some","","Empty","",Fields "" Wrapped,"",in quotes
"Malformed" CSV,Data,"Note it has data after" a close quote,"and before a comma,",,"INVALID"

This produces:

1:8:1["RAM"]2["31st street, Bengaluru, India"]3[]4[]5[]6["7865431234"]7[]8["VALID"]
2:8:1["Mr ""Manipulator"", the Artisan"]2["29th Street, Delhi, India"]3[]4[]5[]6[]7[]8["INVALID"]
3:7:1["Some"]2[""]3["Empty"]4[""]5[Fields "" Wrapped]6[""]7[in quotes]
4:6:1["Malformed" CSV]2[Data]3["Note it has data after" a close quote]4["and before a comma,"]5[]6["INVALID"]

Note that the field numbers are included as a prefix to the bracketed data (so I tweaked the print format slightly).

About the only format this doesn't handle is one where newlines can be embedded in the data for a field — by the nature of the line-based input, it assumes that no field is split over multiple lines. (It also means it won't properly recognize a field that starts with a double quote and doesn't have a matching double quote before the end of the line. I suppose you could add an alternative to recognize that. It would be better just to make the data right.)


Note the advice in Sobrique's answer to use a tool designed to handle CSV for handling CSV. That is generally a good idea, and the more complex the sets of variations you have to deal with, the better an idea it is. This is close to as complicated a regex as you should consider using. Also note that although RFC 4180 defines a version of CSV formally and rigorously, there are multiple programs (including MS Office) that handle different but related formats.