How to analyze a string without losing more sign in PHP?

advertisements

I am parsing HTML strings to get values in PHP and write them in database. Here is an example string:

<b>Adress:</b> 22 Examplary road, Nowhere <br>
<b>Phone:</b>  +371 12345678, +371 23456789<br>
<b>E-mail: </b>[email protected]<br>

The string can be formatted in random manners. It can contain additional keys that I am not parsing out and it can contain duplicate keys. It can also contain only some of the keys that I am interested in or be completely empty. HTML can also be broken (example tag: <br). I have decided that I will follow the rules that entries are separated by \n and are in the form key: value + some HTML.

First, I use this code to make the string parseable:

$parse = strip_tags($string);
$parse = str_replace(':', '=', $parse);
$parse = str_replace("\n", '&', $parse);
$parse = str_replace("\r", '', $parse);
$parse = str_replace("\t", '', $parse);

My string looks something like this now:

Adress= 22 Examplary road, Nowhere&Phone=  +123 12345678, +123 23456789&E-mail= [email protected]

Then I use parse_str() to get the values and then I take out the values if the needed keys are found:

        parse_str($parse, $values);

        $address = null;
        if (isset($values['Adress']))
            $address = trim($values['Adress']);

        $phone = null;
        if (isset($values['Phone']))
            $phone = trim($values['Phone']);

The problem is that I end up with $phone = '371 12345678, 371 23456789' - I lose the + signs. How to conserve those?

Also, if you have any hints how to improve this procedure, I would be glad to know that. Some entries have Website: example.com, others have Web Site example.com... I am pretty sure that it will not be possible to automatically parse all of the information but I am looking for the best possible solution.

Solution

Using tips provided by WEBjuju I am now using this:

preg_match_all('/([^:]*):\s?(.*)\n/Usi', $string, $matches, PREG_SET_ORDER);

$values = [];
foreach ($matches as $match)
{
    $key = strip_tags($match[1]);
    $key = trim($key);
    $key = mb_strtolower($key);
    $key = str_replace("\s", '', $key);
    $key = str_replace('-', '', $key);

    $value = strip_tags($match[2]);
    $value = trim($value);

    $descriptionValues[$key] = $value;
}

This allows me to go from this input:

<b>Venue:</b> The Hall<br
<b>Adress:</b> 22 Examplary road, Nowhere <br>
<b>Phone:</b>  +371 12345678<br>
<b>E-mail: </b>[email protected]<br>
<b>Website:</b> <a href="http://example.com/" target="_blank">example.com</a><br>

To a nice PHP array with homogenized and hopefully recognizable keys:

[
    'venue' => 'The Hall',
    'adress' => '22 Examplary road, Nowhere',
    'phone' => '+371 12345678',
    'email' => '[email protected]',
    'website' => 'example.com',
];

It still doesn't account for the cases of missing colons, but I don't think I can solve that...


Realizing that you have preformed HTML that conforms to a simple standard structure I can tell you that regular expression matching will be the best way to grab this data. Here is an example to get you on your way - I'm sure it doesn't solve everything, but it solves what your issue is on this post, where you are troubled with "finding key/var matches".

// now go get those matches!
preg_match_all('/<b>([^:]*):\s?<\/b>(.*)<br>/Usi', $string, $matches, PREG_SET_ORDER);
die('<pre>'.print_r($matches,true));

That will output, for instance, something like this:

Array
(
  [0] => Array
    (
        [0] => <b>Adress:</b> 22 Examplary road, Nowhere <br>
        [1] => Adress
        [2] =>  22 Examplary road, Nowhere
    )

  [1] => Array
    (
        [0] => <b>Phone:</b>  +371 12345678, +371 23456789<br>
        [1] => Phone
        [2] =>   +371 12345678, +371 23456789
    )

  [2] => Array
    (
        [0] => <b>E-mail: </b>[email protected]<br>
        [1] => E-mail
        [2] => [email protected]
    )

And from there, I'd have to guess that you can putt that in for par.