Extract values ​​from a CSV string containing empty fields with PCRE regex

advertisements

I try to capture values of columns including(!) empty columns from multiple csv-like simple data column strings, seperated by a semicolon. And even if I know that regex isn't the best approach for that and explicit csv parsers would do a quite better job, in this case I have no other choice than to use PRCE regex to build html table <td> groups from this data.

The worsed example which should (still) work, looks like this:

;testvalue;"testvalue";"test "val"ue";test value;

... which should be literally interpreted like this:

empty | testvalue | testvalue | test "val"ue | test value | empty

... which gets finally rendered into this (not part of the question):

<td>empty</td>
<td>testvalue</td>
<td>testvalue</td>
<td>test"val"ue</td>
<td>test value</td>
<td>empty</td>

(UPDATE like asked of @anubhava)

Sadly there is another downer coming with it: The system where it will be implemented has a fixed way to handle the strings. It will ONLY recognize and ONLY alter captured groups of the string. Any other unregistered part of the string gets directly printed out with the rest as-is. That means: we need to register the semicolons in the regex, even if we don't want them to be printed out, but to remove them by ignoring their matching group.

Usually, it would be enough to only print out the captured group, but this doesn't work here. To ONLY capture the values would cause this output:

;;;;;
<td>empty</td>
<td>testvalue</td>
<td>testvalue</td>
<td>test"val"ue</td>
<td>test value</td>
<td>empty</td>

Maybe we need to capture the whole string first in another group or we need to capture the semicolons in another group to throw them away later in the print out? ...


You can use this much simpler regex with a lookbehind with included semicolons in a 3rd capture group:

$str = ';testvalue;"testvalue";"test "val"ue";test value;';
preg_match_all('/(?<=;|^)("?)([^;]*)\1(;|$)/', $str, $matches);

print_r($matches[2]);

(?<=;|^) is a positive lookbehind to make sure we're matching [^;]* only after line start or a ;.

Output:

Array
(
    [0] =>
    [1] => testvalue
    [2] => testvalue
    [3] => test "val"ue
    [4] => test value
    [5] =>
)

RegEx Demo

And to get the required HTML:

echo "<td>" . implode("</td>\n<td>", $matches[2]) . "</td>\n";

<td></td>
<td>testvalue</td>
<td>testvalue</td>
<td>test "val"ue</td>
<td>test value</td>
<td></td>